Generative AI in Academia: From Hallucinations to Rigor

Manuel Blázquez Ochando

Full Professor, Faculty of Documentation Sciences

Department of Library and Information Science. Universidad Complutense de Madrid

1. Introduction: Media Perception and the Necessary Nuance

On April 9, 2026, the newspaper ABC published an article titled "Academic Journals Fill Up with «Hallucinations» of AI: «They Have Cited Works of Mine That Do Not Exist»", authored by Beatriz L. Echazarreta. The journalistic piece recounts the experience of Professor José Antonio Sanahuja from the Complutense University of Madrid, who was perplexed to discover that an article published in a journal endorsed by the Spanish Foundation for Science and Technology (FECYT) contained citations to works attributed to him that had never been written. The case, which ended with the retraction of the article after verifying that twenty-six bibliographic references could not be confirmed, is paradigmatic of an emerging phenomenon that has raised alarms within the international academic community and has been debated since the first public-domain GPT models appeared in 2023.

It is advisable to begin this analysis by acknowledging that the ABC article correctly identifies a real and concerning symptom. Indeed, there exists a segment of researchers who, lacking adequate methodological training, delegate to generative artificial intelligence systems—typically accessible through conversational interfaces such as ChatGPT—the task of drafting entire sections of their academic work, including the delicate process of constructing the bibliography. The outcome of this practice, as the newspaper rightly documents, is the proliferation of what is colloquially termed “hallucinations”: bibliographic references that appear plausible in form but are entirely fictitious in content, thereby contaminating the ecosystem of scientific communication and eroding trust in knowledge validation mechanisms.

However, where journalistic analysis stops—at the anecdote, legitimate and newsworthy as it may be—scientific-documentary analysis must necessarily advance toward the etiology of the phenomenon and, more importantly, toward a rigorous characterization of the methodologies that enable the use of these technologies in a manner that is not only harmless but genuinely productive for the advancement of knowledge. The ABC report, in its necessary journalistic brevity, falls into a reductionism that warrants qualification: it presents the tool as inherently flawed, as a “distiller of inventions that does not know how to say no,” in the words of Professor Sanahuja cited by the media, rather than focusing the critique on the absence of documentary methodology underlying cases of misuse.

My purpose in the following pages is by no means to launch a crusade against a medium of communication that, after all, fulfills its social function of alerting the public to an emerging problem. Rather, I aim to complement this necessarily superficial perspective with a deep immersion into the reality of scientific work with artificial intelligence—a reality that unfolds through methodological pathways radically distinct from those imagined by the news—and, by extension, public perception. The question that structures this article is not whether AI hallucinates or does not hallucinate—it certainly does when used without adequate oversight—but rather this far more substantive one: How does artificial intelligence operate when subjected to the methods of Science?

To address this question, it will be necessary to undertake a journey that begins with the very anatomy of the hallucinatory phenomenon, continues with a detailed exposition of the methodological foundations developed by Documentation Science to control automated text generation—from retrieval-augmented generation (RAG) systems to the orchestration of multiple agents, each with its own role prompting—, and concludes with a reflection on the role of the human researcher within this new technological ecosystem. The reader will discover that between the media image of the user confronting a conversational black box and the reality of the contemporary documentary laboratory lies a methodological abyss that merits surgical-level exploration.

2. The Anatomy of Hallucination: How the User Interface and Naive Questions Induce Error

Before delving into the complex intricacies of the scientific methodology applied to generative AI, it is essential to precisely understand the mechanism underlying the phenomenon so accurately described in the ABC newspaper article. For hallucination—or, if one prefers the more technically precise term coined by Ramón Salaverría in the same journalistic piece, fabulation—does not constitute a random failure of "intelligence" on the part of the model, but rather the logical and, to some extent, predictable consequence of an inadequate interaction design.

To understand it, we must temporarily shed the anthropomorphic metaphor that so poorly serves public comprehension of these technologies. A large language model (LLM), such as those underlying ChatGPT, Claude, or Gemini does not "think," does not "know," and certainly does not "remember" in the human sense of these terms. It is, fundamentally, an extremely sophisticated sequence prediction system—predicting tokens, or word fragments—trained on vast quantities of text to maximize the statistical likelihood of its outputs. In other words: through processing billions of documents, the model has learned which words tend to appear together, which syntactic structures are probable in each context, and, crucially for our case, what textual patterns constitute a bibliographic reference.

When a researcher—say, a PhD candidate under pressure to meet deadlines—opens a conversational interface and types a prompt such as: "Give me ten key bibliographic references on theories of International Relations in the 21st century", they are activating a very specific mechanism. The model does not access any bibliographic database, nor does it consult Scopus, OpenAlex, Dialnet, or Google Scholar. What it does, simply, is activate its token prediction engine to generate a textual sequence that statistically resembles responses to similar requests found in its training corpus.

The result is a string of characters that appears like an academic reference: a plausible author name (a common combination of first and last name in the field’s literature), followed by a title that vaguely evokes the requested topic, followed by the name of a recognizable journal and a coherent publication year. The model has succeeded in its fundamental task: it has produced a plausible text. The problem, of course, is that plausibility does not equate to truth, and the system lacks any mechanism to distinguish between the two.

The ABC article on the phenomenon of “hallucination” in AI models incorrectly attributes to OpenAI a study that actually belongs to Linardon (2025). The information disseminated originates from a publication in Nature, which addresses the phenomenon but cites Linardon’s study as the primary source. In this work, Linardon employs OpenAI’s ChatGPT 4o to conduct a synthetic test of reference generation, demonstrating how standard training and evaluation procedures tend to “reward the fabrication of a conjecture more than the recognition of one’s own ignorance”. Operationally, the model has been optimized to provide any response rather than admit it lacks information. Silence or abstention are not incentives reinforced by the reinforcement learning process. Thus, when confronted with a request for bibliographic references beyond its factual capabilities, the model chooses the only strategy its architecture permits: generating a formally correct yet materially false answer.

It is crucial to highlight here the categorical error committed by the average user who formulates such a prompt: they are using AI as a search engine, not as a controlled generator. They are demanding of the system a function—the precise retrieval of factual information—for which it has not been designed or trained (at least not yet, although everything is a matter of time). The language model does not index the web in real time, does not maintain a minute-by-minute updated and verified knowledge base, and cannot distinguish between authoritative sources and spurious content (beyond the selection and scope of its training data and its ingestion bias). Its domain is plausible text generation, not propositional truth.

This distinction is absolutely central to everything that will follow in the subsequent lines. Because the scientific methodology we will develop in the following sections does not aim to "correct" a supposed deficiency of the model—the model performs exactly what it was designed to do—but rather to completely reconfigure the operational context in which the model is deployed. It is not a matter of asking the system to stop hallucinating, but of constructing a methodological framework that renders hallucination structurally impossible.

3. The True Scientific Use (I): The Construction of the RAG Corpus

3.1. From the Open Web to the Repository of Validated Fragments

We thus arrive at the turning point that clearly separates the negligent praxis so justly criticized by the ABC of rigorous methodology characterizing work in laboratories of Documentation Science applied to generative AI. And the first pillar upon which this methodology rests is none other than the system of Retrieval-Augmented Generation — Retrieval-Augmented Generation or RAG, by its English acronym—.

The principle underlying RAG is conceptually simple, although its technical implementation entails considerable complexity. Essentially, it involves reversing the informational flow characteristic of the naive querying described in the previous section. Instead of allowing the language model to draw from its vast, heterogeneous, and unverified training corpus—with all the associated burdens of inaccuracies, biases, and obsolescence—the RAG system deliberately restricts the search space to a curated, validated, and controlled document repository managed by the researcher himself. It is essential to pause and examine each of the phases constituting this process, for it is within them that lies the qualitative distinction separating occasional instrumental use from genuine scientific application.

3.1.1. Researcher’s Pre-selection: The Construction of the Specialized Corpus

The process begins with a fundamental methodological decision that rests exclusively with the human researcher: the selection of the documentary corpus that will serve as the knowledge foundation for the Digital Twin. Far from the image of a user posing an open-ended question to the vastness of the Internet, the documentalist scientist assumes here a role of maximum responsibility in curation.

In practice, this phase involves the systematic collection of a set of academic documents—typically in PDF format—that the researcher deems relevant to the specific domain framing their work. The volume of this corpus may range from a hundred to several thousand documents, depending on the scope of the field and the research objectives. What matters crucially is not the quantity, but the verified quality of the selected sources.

These documents invariably originate from established academic databases and have therefore passed the quality filters imposed by these platforms. Frequently, the corpus also includes the researcher’s prior work and that of their research group, thereby ensuring epistemological continuity with established lines of investigation.

The essential point that must be made absolutely clear is the following: the scientist does not ask the AI model about the world at large; rather, they ask about a subdomain of the world that they themselves have previously delimited and validated. This operation of closing off the domain of knowledge constitutes the first and most fundamental line of defense against bibliographic hallucination. If the system can only “see” what the researcher has introduced into its knowledge base, it is structurally impossible for it to generate a reference to a non-existent article, simply because that article does not exist within the informational space accessible to the system.

3.1.2. Chunking and Vectorization: Transforming Documents into Mathematical Vectors

Once the documentary corpus has been collected, the technical processing phase begins, enabling semantic information retrieval. PDF documents, with all their typographic and structural richness, are not directly readable by the attention mechanisms of a Large Language Model (LLM) in AI. It is necessary to subject them to a two-step transformation process: chunking, or fragmentation, and vectorization, or embedding.

The chunking consists of dividing each document into manageable text fragments—typically between 500 and 1,500 tokens, depending on the system configuration—while preserving as much as possible natural semantic units (paragraphs, subsections). This operation is not merely mechanical: inadequate segmentation can disrupt argumentative coherence and hinder precise information retrieval. Therefore, more advanced systems employ semantic chunking algorithms that respect the natural boundaries of academic discourse.

Each of these fragments—or chunks—is then subjected to an embedding or vectorization process. Using a specialized model—distinct from the main generative model—each text fragment is converted into a mathematical representation: a high-dimensional vector (typically 768, 1,024, or 1,536 dimensions) that encodes its semantic content. The fundamental property of these vectors is that their proximity in the vector space corresponds to the semantic similarity of the texts they represent. Fragments addressing analogous topics, even if employing different vocabulary, will occupy nearby positions in this mathematical space.

These vectors, along with the metadata associated with each fragment (source document, position in the text, authors, publication year, DOI), are stored in a specialized vector database optimized for high-speed similarity searches. It is this database, and not the generative model, that assumes responsibility for factual information retrieval.

3.1.3. Semantic Retrieval: How the Digital Twin Accesses Knowledge

When the researcher poses a query to their Digital Twin

The user's query is vectorized using the same embedding model that was used to process the document corpus.
The resulting vector is compared with millions of vectors stored in the vector database, identifying those fragments (chunks
The system retrieves the N most relevant fragments—typically between five and twenty, depending on the configuration—and injects them, along with their metadata, into the prompt that will ultimately be sent to the generative model.

It is only at this point that the language model comes into action. Its task is no longer to imagine a plausible response based on statistical patterns learned during training, but to synthesize, summarize, rephrase, or expand the content of the retrieved fragments. The model now operates as a sophisticated editor working with explicitly provided sources, not as an oracle drawing knowledge from its inscrutable parametric memory. The implications of this redesign of the workflow are profound and deserve to be emphasized with complete clarity:

Structural Impossibility of Bibliographic Hallucination: The system can only cite sources that are present in the document corpus uploaded by the researcher. There is no room for the invention of nonexistent references, simply because the model is not generating references ex nihilo, but rather reproducing—with the appropriate format—the metadata associated with the retrieved fragments.
Absolute Traceability: Every statement generated by the system can be traced back to its original chunk and, through it, to the academic document serving as its source. The researcher maintains full control over the genealogy of the produced knowledge at all times.
Dynamic updating: The document corpus can be enriched at any time with new publications, ensuring that the system always operates with the most recent state of the art without requiring costly model retraining processes.

In summary, the RAG methodology entails a radical transformation of the epistemic status of generative AI. Rather than being a generator of plausible but potentially fictitious text, the system becomes a tool for evidence-based research assistance grounded in verifiable documentary sources. The contrast with the practice described in the ABC article could not be more stark.

4. The True Scientific Use (II): The Expert Committee as Orchestrated AI Agents in a Complex Pipeline

4.1. Beyond the solitary user: Orchestration of multiple agents

If the RAG system constitutes the foundation upon which the scientific use of generative AI is built, the second methodological pillar concerns the manner in which the language model is summoned to perform its task. Here too, rigorous practice diverges drastically from the widespread yet erroneous image of a solitary user typing questions into a conversational interface and uncritically accepting the responses obtained.

In the Laboratory of Documentation Science Applied to AI, the language model is not treated as a singular interlocutor, but rather as a platform upon which multiple specialized agents are deployed, each configured via role prompting to perform a specific function within an orchestrated textual production pipeline.

It is important to clarify here the concept of agent, as a full understanding of this concept is a prerequisite for appreciating the sophistication of the approach. An AI agent, in the context under consideration, is not a virtual person nor an anthropomorphic simulation. Rather, it is, more prosaically, an instance of a language model to which a specific functional role, a set of operational instructions, and, in many cases, a specific subset of the document corpus on which to work have been assigned through a carefully designed system prompt.

The difference between querying a "naked" model and deploying an orchestrated committee of agents is analogous to asking a first-year student to draft a doctoral thesis versus submitting the same text to a panel composed of specialists in methodology, academic writing, bibliographic review, and editing. In the first case, the outcome will inevitably be limited by the capabilities—and biases—of the single agent involved. In the second, the structured interaction of multiple critical perspectives substantially enhances the quality of the final product.

4.2. The Orchestrated Pipeline: Sequence of Agents and Functions

I will now describe the typical sequence of agents constituting an AI-assisted academic production pipeline within the context of documentary research. The reader will observe that each agent assumes a limited responsibility, and the output of each phase serves as the input for the subsequent phase.

4.2.1. The "Ideator" Agent: Generation of Hypotheses and Argumentative Frameworks

The first phase of the pipeline corresponds to the "Ideator" Agent, an instance of the model configured via role prompting with the following instructions:

"You are a specialist in generating research hypotheses in the field of [specific domain]. Your task is to propose [N] original argumentative lines grounded in the state of the art reflected in the provided documentary corpus. You must identify gaps in the existing literature, unexplored connections between theoretical traditions, and opportunities for conceptual advancement. Present each proposal as a hierarchical schema comprising a main thesis and supporting arguments."

The Ideator Agent operates exclusively on the RAG corpus previously loaded by the researcher. Its function is not to invent hypotheses in a vacuum, but to detect patterns, tensions, and opportunities within the selected body of literature. To this end, the system launches multiple parallel queries to the agent, slightly varying the temperature parameters—which control the degree of "creativity" or variability in responses—to generate a diverse range of proposals. The output of this phase is a set of N argumentative schemas—typically between ten and thirty—that will be evaluated in subsequent stages of the pipeline.

4.2.2. The Drafting Agent: Iterative Textual Development over the RAG Corpus

Once the argumentative schema to be developed has been selected—selection that may be performed by the human researcher or, in more advanced configurations, by an evaluative meta-agent—the Drafting Agent is activated. The configuration of this agent via role prompting is particularly meticulous; let us examine a simple example:

"You are an academic editor specialized in [specific domain] with extensive experience publishing in high-impact journals [Specify journal if applicable]. Your task is to textually develop the provided argumentative outline, strictly adhering to the sources contained within the provided RAG corpus. You must employ a formal, precise, and rigorous academic style. Every substantive claim must be explicitly supported by a reference to the corresponding document in the corpus. Do not introduce any information not present in the RAG-retrieved fragments."

The Editor Agent’s work is by no means a single operation. The pipeline is designed to execute N iterations of progressive refinement. In each iteration, the generated text is re-injected into the model alongside refinement instructions that may include:

Enhancement of cohesion between paragraphs.
Enrichment of argumentative density.
Terminological precision.
Adherence to the target journal’s style.
Verification of consistency between claims and cited sources.

This iterative process may extend over dozens of cycles, automatically supervised by the orchestration system, which monitors the evolution of quality metrics to determine when a satisfactory threshold of marginal improvement has been reached.

4.2.3. Reviewer Agent: Evaluation According to Q1 Journal Standards

The text produced in the iterative drafting phase is not yet considered a final product. It must undergo scrutiny by a Review Agentrole prompting configuration for this agent explicitly incorporates the evaluation forms made available by journals such as Nature, Science, or Scientometrics to their reviewers:

"You are an anonymous reviewer for [specific Q1 journal]. You must evaluate the provided manuscript using the following criteria on a scale of 1 to 10: (a) Originality and novelty of the contribution; (b) Methodological rigor; (c) Appropriateness and recency of the bibliography; (d) Clarity of exposition and argumentative structure; (e) Relevance to the field. For each criterion, you must provide a detailed justification and, where applicable, concrete suggestions for improvement."

The Review Agent does not issue a binary judgment (approved/rejected), but rather a weighted evaluation that is integrated into the automated ranking system we will describe in the following section. Additionally, the agent generates a detailed report of strengths and weaknesses, which is reintroduced into the pipeline for another round of corrective iterations by the Editor Agent. This writing-evaluation-correction cycle may be repeated multiple times until quality metrics exceed the thresholds preestablished by the researcher.

4.2.4. Citation Agent: Extraction of References from RAG Fragments

One of the most critical functions—and the one whose misuse gives rise precisely to the cases reported in the ABC article—is the management of the bibliographic apparatus. For this task, the pipeline incorporates a Citation Agent specifically designed for this purpose. Its configuration is particularly restrictive:

*"You are a specialist in bibliographic reference management. Your task is to: (1) Identify all statements in the text that require bibliographic support. (2) For each of them, retrieve from the RAG corpus the source document from which the information originates. (3) Generate a citation in the specified format [APA 7th edition / Chicago / Vancouver / IEEE] using exclusively the metadata contained in the RAG fragment. (4) Compile the final reference list, eliminating duplicates and verifying formal consistency."*

It is essential to note that the Citation Agent does not generate new references, but rather extracts and formats the metadata from RAG fragments that have been used as sources during the writing phase. Traceability is absolute: each citation in the text can be unambiguously linked to the chunk from the corpus it originates from, and through it, to the original academic document.

4.2.5. Citation Validator Agent: Cross-verification and Quality Control

The pipeline finally incorporates a Citation Validator Agent that acts as a specific quality control instance for the bibliographic apparatus. Its function is to systematically verify, through cross-checking, that all citations appearing in the text correspond effectively to existing documents in the RAG corpus and that the metadata are complete and accurate.

The configuration of this agent includes instructions of the following type:

"You must verify, one by one, all bibliographic references in the manuscript. For each in-text citation: (a) Locate the RAG fragment serving as its source. (b) Confirm that the metadata (author, year, title, publication) match exactly. (c) Verify that the DOI or equivalent identifier is correct and resolves to the original source. (d) Generate a discrepancy report if any inconsistencies are found. If any reference cannot be traced back to the RAG corpus, mark it as 'NOT VERIFIED' for human review."

This agent constitutes the final automated barrier against bibliographic hallucination. If an imprecise or fictitious reference has slipped through due to an error in earlier phases of the pipeline, the Validator Agent will detect it and flag it for human intervention.

4.3. Orchestration as the Key to Methodological Rigor

The detailed description of the agents comprising the pipeline should not obscure a fundamental aspect: the value of the system lies not so much in each individual agent as in the structured orchestration of their interactions. It is the ordered sequence of generation, evaluation, correction, and validation that ensures the quality of the final output.

In practice, this orchestration is implemented through specialized frameworks —LangChain, LlamaIndex, Semantic Kernel, or custom developments — that manage the flow of information between agents, the persistence of intermediate states, the logging of quality metrics, and the full traceability of the process.

The result is a complex system, certainly, but also transparent and auditable. Every decision, every text transformation, and every bibliographic reference can be traced back to its origin. Nothing is left to the opacity of an inscrutable black box.

5. The Generation and Weighted Evaluation Pipeline: Automated Ranking

5.1. Beyond Single Generation: The Paradigm of Selection Among Multiple Candidates

A fundamental principle applied to generative AI states that a text is not generated; multiple texts are generated, and the optimal one is selected through objective criteria. This principle, which may seem counterintuitive to those accustomed to the human writing paradigm—where rewriting is costly and one tends to work on a single draft—constitutes one of the pillars of the scientific methodology in this field. The underlying logic is clear: given that the computational cost of generating multiple versions of a text is "relatively low," and given that small variations in generation parameters (temperature, top-p, system prompt) can produce significantly different results, it is methodologically optimal to explore the space of possible texts and subsequently apply rigorous selection criteria.

5.2. Parallel Generation with Controlled Parameter Variation

The pipeline is configured to launch, in each phase susceptible to variation, multiple parallel generation processes. These processes may differ in several aspects:

Model temperature: Lower values (0.1–0.3) produce more deterministic and conservative text; higher values (0.7–1.0) favor variability and lexical creativity.
System prompt configuration: Minor variations in role prompting can emphasize different aspects (e.g., methodological rigor versus conceptual originality).
RAG Chunk Selection: Minor variations in the semantic retrieval algorithm can return slightly different sets of chunks, introducing nuances in the documentary grounding.

For each phase of the pipeline—ideation, initial drafting, iterative refinement—the system typically generates between five and twenty candidate versions. These versions are stored alongside metadata from their generation process (parameters used, retrieved RAG chunks, processing time) for subsequent evaluation.

5.3. Rubric-Based Weighting: Objective Evaluation of Textual Quality

The Expert Committee — the AI agents described in the previous section — does not limit itself to issuing qualitative judgments based on impressions or opinions. Each agent applies a predefined evaluation rubric that breaks down text quality into measurable dimensions and assigns numerical scores in whole or decimal numbers, always using standardized scales. A typical rubric for evaluating an academic article might include the following dimensions, see Table 1.

DIMENSION / SUB-DIMENSION	OPERATIONAL DESCRIPTION AND MEASURABLE INDICATORS	SCALE / WEIGHT / EVALUATING AGENT
DIMENSION 1: DOCUMENTARY FIDELITY (Total Weight: 35%)	Overall Scope: Degree to which the generated content faithfully adheres to the information contained within the retrieved RAG chunks, without introducing information unsupported by the corpus.	Aggregate assessment of the three sub-dimensions Agents: Citation Validator + Q1 Reviewer + Ideator + Orchestration
1.1. Source-Text Correspondence	Description: Degree to which the generated content faithfully matches the information contained in the retrieved RAG chunks. Indicators: Percentage of statements traceable to a specific chunk; absence of information not supported by the corpus.	Scale: 0-10Weight: 20%Agent: Citation Validator + Q1 Reviewer
1.2. Paraphrase Accuracy	Description: Accuracy with which the original content is reformulated without introducing semantic distortions. Indicators: Controlled semantic similarity index (non-literal, yet non-divergent).	Scale: 0-10Weight: 10%Agent: Q1 Standard Reviewer
1.3. Corpus Coverage	Description: Proportion of the relevant RAG corpus that has been effectively mobilized in the text generation. Indicators: Number of source documents cited / Total relevant documents in the corpus.	Scale: 0-10Weight: 5%Agent: Ideator + Orchestration
DIMENSION 2: METHODOLOGICAL RIGOR (Total Weight: 25%)	Overall Scope: Logical soundness of inferences and conclusions, consistency with the stated research design, and adequate treatment of alternative perspectives.	Aggregate assessment of the three sub-dimensions Agents: Methodological Specialist Reviewer + Q1 Reviewer
2.1. Alignment with Research Design	Description: Coherence between the generated text and the stated methodological design (qualitative, quantitative, mixed, theoretical). Indicators: Explicit presence of a methodological statement; internal consistency of the approach.	Scale: 0-10Weight: 10%Agent: Methodological Specialist Reviewer
2.2. Argument Validity	Description: Logical soundness of the presented inferences and conclusions. Indicators: Absence of detectable logical fallacies; explicit argumentative structure.	Scale: 0-10Weight: 10%Agent: Q1 Standard Reviewer
2.3. Treatment of Counterarguments	Description: Inclusion and reasoned refutation of alternative perspectives present in the RAG corpus. Indicators: Number of counterarguments identified and addressed / Total divergent perspectives in the corpus.	Scale: 0-10Weight: 5%Agent: Q1 Standard Reviewer
DIMENSION 3: BIBLIOGRAPHIC APPARATUS QUALITY (Total Weight: 25%)	Overall Scope: Unequivocal traceability of each citation to the RAG corpus, formal correctness of the required format, and relevance of the mobilized sources.	Aggregate assessment of the three sub-dimensions Agents: Citator + Citation Validator + Reviewer + RAG System
3.1. Citation Traceability	Description: Ability to unequivocally link each citation to a document within the RAG corpus and, through it, to the original source. Indicators: Percentage of citations with verified traceability (DOI, handle, or path to source chunk).	Scale: 0-10Weight: 15%Agent: Citator + Citation Validator
3.2. Formal Reference Accuracy	Description: Strict adherence to the required citation style (APA 7th, Chicago, Vancouver, IEEE). Indicators: Percentage of references that pass automated format validation.	Scale: 0-10Weight: 5%Agent: Citator
3.3. Relevance of Cited Sources	Description: Relevance and authority of the corpus documents mobilized to support each claim. Indicators: Average impact index of cited sources (according to corpus metrics); verified thematic appropriateness.	Scale: 0-10Weight: 5%Agent: Reviewer + RAG System
DIMENSION 4: COHERENCE AND ARGUMENTATIVE STRUCTURE (Total Weight: 13%)	Overall Scope: Clarity of the main thesis, logical progression of discourse with explicit transitions, and proportional balance among sections.	Aggregate assessment of the three sub-dimensions Agents: Writer + Q1 Reviewer + Orchestration
4.1. Main Thesis Clarity	Description: Presence of an explicit, recognizable thesis statement, appropriately positioned within the text structure. Indicators: Automated thesis identification; normalized position within IMRaD or equivalent structure.	Scale: 0-10Weight: 5%Agent: Writer (final iteration) + Reviewer
4.2. Logical Discourse Progression	Description: Adequate sequencing of ideas, with explicit transitions and a clear argumentative hierarchy. Indicators: Textual cohesion index (measured via analysis of connectors and anaphoric references).	Scale: 0-10Weight: 5%Agent: Q1 Standard Reviewer
4.3. Structural Balance	Description: Adequate proportionality among text sections (introduction, development, conclusions). Indicators: Deviation from reference structural ratios for the document type and disciplinary field.	Scale: 0-10Weight: 3%Agent: Writer + Orchestration
DIMENSION 5: ORIGINALITY AND NOVELTY (Total Weight: 20%)	Overall Scope: Ability to identify gaps in the literature, propose conceptual contributions not explicitly present in the corpus, and establish non-obvious connections among documents.	Aggregate assessment of the three sub-dimensions Agents: Ideator + Reviewer + RAG System
5.1. Identification of Literature Gaps	Description: Capacity of the text to explicitly point out knowledge gaps within the analyzed RAG corpus. Indicators: Number of identified and justified gaps; contrast with the documented state of the art.	Scale: 0-10Weight: 8%Agent: Ideator + Reviewer
5.2. Conceptual Contribution	Description: Degree to which the text proposes constructs, typologies, models, or hypotheses not explicitly present in the corpus. Indicators: Controlled lexical novelty; semantic distance from RAG chunks (measured with creativity thresholds).	Scale: 0-10Weight: 7%Agent: Ideator + Reviewer
5.3. Integrative Synthesis	Description: Ability to establish non-obvious connections among documents or traditions within the RAG corpus. Indicators: Number of explicit inter-document connections; density of cross-citations among disparate sources.	Scale: 0-10Weight: 5%Agent: Ideator + RAG System
DIMENSION 6: EXPOSITORY QUALITY AND ACADEMIC STYLE (Total Weight: 11%)	Overall Scope: Terminological precision, academic readability, and adherence to the rhetorical conventions of the specific discursive genre.	Aggregate assessment of the three sub-dimensions Agents: Writer + Q1 Reviewer
6.1. Terminological Precision	Description: Correct and consistent use of specialized vocabulary specific to the disciplinary field. Indicators: Terminological consistency index; alignment with domain thesaurus.	Scale: 0-10Weight: 5%Agent: Writer + Reviewer
6.2. Academic Readability	Description: Fluency of discourse without compromising conceptual rigor. Indicators: Readability index adapted for academic texts (e.g., modified Flesch-Szigriszt).	Scale: 0-10Weight: 3%Agent: Writer (refinement iterations)
6.3. Discursive Genre Appropriateness	Description: Conformity with the rhetorical conventions of the specific academic genre (article, review, theoretical essay).Indicators: Presence of canonical rhetorical moves of the genre (CARS or other models).	Scale: 0-10Weight: 3%Agent: Q1 Standard Reviewer
DIMENSION 7: ROBUSTNESS AGAINST BIAS (Total Weight: 11%)	Overall Scope: Detection of confirmation bias, evaluative neutrality of the academic tone, and representativeness of the corpus employed.	Aggregate assessment of the three sub-dimensions Agents: Anti-Bias Validator + Reviewer + Human Researcher
7.1. Confirmation Bias Detection	Description: Verification that the text does not selectively ignore contrary evidence present in the RAG corpus. Indicators: Proportion of RAG chunks contradicting the thesis that have been addressed in the text.	Scale: 0-10Weight: 5%Agent: Validator (anti-bias configuration)
7.2. Evaluative Neutrality	Description: Appropriateness of tone according to standards of academic objectivity, avoiding hyperbole or unsubstantiated judgments. Indicators: Sentiment analysis and detection of unjustified evaluative language.	Scale: 0-10Weight: 3%Agent: Reviewer + Linguistic Analysis Tools
7.3. Corpus Representativeness	Description: Verification that the RAG corpus employed does not introduce selection biases that distort the conclusions. Indicators: Diversity of sources (authors, theoretical traditions, publication years) in the effectively utilized chunks.	Scale: 0-10Weight: 3%Agent: Human Researcher + Orchestration
DIMENSION 8: TRACEABILITY AND REPRODUCIBILITY (Total Weight: 10%)	Overall Scope: Automated documentation of all pipeline phases, capacity to regenerate equivalent results, and process auditability.	Aggregate assessment of the three sub-dimensions Agents: Orchestration + Human Researcher
8.1. Generation Pipeline Logging	Description: Automated documentation of all process phases: prompts used, temperature parameters, iterations performed. Indicators: Completeness of the execution log; presence of all required metadata.	Scale: 0-10Weight: 5%Agent: Orchestration (automated)
8.2. Result Reproducibility	Description: Capacity to regenerate a substantially equivalent text from the same corpus and configuration. Indicators: Similarity index between independent runs with identical parameters.	Scale: 0-10Weight: 3%Agent: Orchestration (replication tests)
8.3. Process Auditability	Description: Availability of the necessary information for a third party to verify the methodology employed. Indicators: Presence of an explicit methodological statement; access to the pipeline log (with ethical restrictions).	Scale: 0-10Weight: 2%Agent: Human Researcher (transparency statement)
SUMMARY OF WEIGHTS	Dimension 1: Documentary Fidelity (35%)+ Dimension 2: Methodological Rigor (25%) + Dimension 3: Bibliographic Apparatus Quality (25%) + Dimension 4: Coherence and Argumentative Structure (13%) + Dimension 5: Originality and Novelty (20%) + Dimension 6: Expository Quality and Academic Style (11%) + Dimension 7: Robustness Against Bias (11%) + Dimension 8: Traceability and Reproducibility (10%)	MAXIMUM GLOBAL SCORE: 150 points Normalization: Divide by 15 to obtain a 0-10 scale

Table 1. Example of the textual and content quality dimensions that can be evaluated

Each candidate version is evaluated by multiple agents (typically, Reviewer Agents configured with different editorial standards), and the scores are aggregated using a weighted average that reflects the relative importance of each dimension for the specific objective of the project.

5.4. The Automated Ranking as a Decision-Making Tool

The outcome of this multi-agent evaluation process is a weighted ranking of the candidate versions, which the system presents to the human researcher in the form of a structured report. This report may include, among other aspects:

The overall score for each version on a 0–10 scale.
A detailed breakdown by dimensions, enabling identification of specific strengths and weaknesses.
A synthesis of the qualitative comments generated by the evaluating agents.
An automated recommendation regarding the optimal version, accompanied by a statistical confidence level.

It is at this point that the transformation of the researcher’s role becomes most evident. The scientist does not correct a mediocre draft; rather, they select among several high-quality drafts generated automatically. Their task is no longer the tedious correction of formal errors or the struggle with expression issues, but rather the exercise of expert judgment to choose the version that best aligns with their vision of scientific contribution.

6. Human Oversight as the Final and Indispensable Filter

6.1. From “Copy and Paste” to “Informed Approval”

At this stage, one might wonder whether the sophistication of the described pipeline reduces the human researcher to a mere spectator of an automated process. Nothing could be further from the truth. The complexity of the agent and automated evaluation system does not diminish the scientist’s responsibility; it reconfigures and, in a sense, elevates it.

It must be unequivocally clear: the orchestrated pipeline does not produce articles ready for publication without human intervention (although very little remains to achieve this, in my view, based on what I am observing in the laboratory). It produces high-quality candidates that require final validation by an expert. The appropriate metaphor is not that of an autopilot replacing the aircraft commander, but rather that of a sophisticated assistance system enabling the pilot to focus on strategic decisions while the machine manages routine operations.

6.2. The Quality Bottleneck: Critical Reading and Expert Validation

The pipeline is designed so that the production flow converges on a deliberate bottleneck: the moment when the human researcher confronts the candidate text and exercises expert judgment. This phase is not a bureaucratic formality; it is the critical juncture where artificial intelligence yields the torch to human intelligence.

The tasks incumbent upon the researcher during this phase include:

Comprehensive reading of the candidate text: This is not a superficial reading. The scientist must verify that the argument presented is sound, that conceptual connections are relevant, and that the proposed contribution is genuinely novel within the context of the field.
Verification of internal logic: The researcher must ensure that the text contains no contradictions, unjustified logical leaps, or excessive simplifications that, although formally correct, betray the complexity of the phenomenon under study.
Comparison with tacit knowledge: There are dimensions of scientific knowledge that are not—and cannot be—captured in the documentary corpus. The researcher brings their familiarity with informal debates, terminological nuances not explicitly addressed in the literature, and emerging trends not yet consolidated in formal publications.
Ethical and Deontological Validation: The scientist is ultimately responsible for ensuring that the generated content complies with the ethical standards of academic research, including proper attribution of authorship, absence of plagiarism, and respect for intellectual property.
Decision to Approve or Request Additional Iterations: If the candidate text does not meet the researcher’s criteria, the researcher may request additional rounds of refinement, specifying to the system the particular aspects requiring improvement.

6.3. Radical Transparency: The Imperative to Declare the Digital Twin as an Ethical Obligation

A crucial aspect of the methodology I am explaining is radical transparency regarding the use of AI tools. In every post on my website mblazquez.es and in every document generated with the assistance of the Digital Twin, I include an explicit statement indicating this nature within the categories section. This practice is not merely an exercise in intellectual honesty—though it is that too—but a methodological and deontological imperative that aims to achieve several objectives:

Avoid confusion: The reader must always know which part of the process has been assisted by AI and under what methodological conditions the text was produced.
Ensure traceability: The declaration of the use of the Digital Twin is accompanied, to the extent possible, by information regarding the employed RAG corpus and pipeline configuration, enabling auditability of the process.
Distinguish rigorous from negligent use: By explicitly stating the methodology employed, a clear line is drawn between the scientific application of generative AI and the improvised patchwork that the ABC article justifiably criticizes.
Normalize the tool: Transparency helps dispel the perception of AI as a "cheating shortcut" and reinforces its recognition as a legitimate research support tool, analogous—though qualitatively distinct—to statistical software or bibliographic managers.

7. Final Synthesis: The Gap Between the Occasional User’s Hallucination and Scientifically Generated Knowledge

7.1. Two Scenarios, Two Methodological Universes

It is now time to recapitulate the journey undertaken and to formulate, with the utmost expository clarity, the fundamental contrast that underpins this article. I will present it through two clearly distinct scenarios.

Scenario A: The One Described in the ABC Article

User: A researcher without specific methodological training in AI.
Tool: A general-purpose conversational interface (ChatGPT or similar).
Prompt: "Give me ten references on [topic]" or "Write an article on [topic]".
Knowledge base: The model's training corpus, heterogeneous, unverified, and opaque.
Process: Single-generation output, without quality control iterations.
Validation: Nonexistent or limited to a superficial review by the user.
Result: Formally plausible text that may contain bibliographic hallucinations, factual inaccuracies, and inadvertent biases.
Consequence: Academic failure, retraction of articles, erosion of trust in the scientific communication system.

This is the scenario that, with full legitimacy, has triggered media alarms. But it is not the only possible scenario. It is not even the scenario in which we operate when applying scientific methodology to the use of these tools.

Scenario B: The Documentation Sciences Perspective

User: Researcher trained in documentary methodology applied to AI.
Tool: Orchestrated pipeline of specialized agents with an integrated RAG system.
Knowledge Base: A curated document corpus assembled by the researcher, composed of verified academic literature (Scopus, WoS, arXiv) and processed via chunking and vectorization.
Process:

- Generation of N hypotheses by the Ideator Agent.

- Iterative textual development (M iterations) by the Editor Agent.

- Evaluation by the Reviewer Agent according to Q1 journal standards.

- Automated revision cycles until quality thresholds are exceeded.

- Citation via Citation Agent with traceability to RAG fragments.

- Reference validation via Validation Agent.

- Weighted ranking of candidate versions.

Human validation: Critical reading, verification of internal logic, comparison with tacit knowledge, final approval.
Transparency: Explicit declaration of the use of the Digital Twin and the methodology employed.
Outcome: A rigorous academic text with complete documentary traceability, grounded in verified sources and validated by criteria analogous to peer review.
Consequence: Verified scientific knowledge, controlled acceleration of academic production, and liberation of the researcher for higher-value tasks.

7.2. AI does not replace the documentary researcher; it demands a much more sophisticated documentary researcher

The conclusion emerging from this analysis is as clear as it is challenging. Contrary to certain apocalyptic narratives predicting the obsolescence of the human researcher, the rigorous application of generative AI to academic production does not diminish the demand for scientific competence, but elevates it to a new level of sophistication. The researcher seeking to employ these tools with methodological rigor must acquire competencies that far exceed those traditionally required in doctoral training:

Competences in Documentation Science: Understanding of information retrieval principles, management of documentary corpora, and knowledge organization systems.
Prompt Engineering Competencies: Ability to design system prompts that appropriately configure the behavior of AI agents, anticipating biases and optimizing response quality.
Pipeline Orchestration Competencies: Knowledge of frameworks that enable chaining agents, managing information flows, and automatically evaluating result quality.
Validation and Monitoring Competencies: Development of robust criteria for the critical evaluation of automatically generated texts, beyond mere formal correctness.
Ethical and Deontological Competencies: Understanding of the implications of AI use in academic production and adoption of radical transparency practices.

Far from the image of the passive user who delegates their intellectual responsibility to a black box, the documentary scientist working with generative AI resembles more the orchestra conductor who coordinates a complex textual production machinery, or the chief editor who oversees the work of a team of specialized writers and reviewers.

7.3. Beyond the Media Anecdote

The article from ABC with which we began this narrative journey fulfills an undeniable social function: alerting to the risks of uncritical and instrumental use of tools whose functioning is unknown. In this sense, science communication journalism provides a valuable service to the academic community and society at large.

However, analysis cannot stop at the anecdote. It falls to those of us working at the intersection of Documentation Science and Artificial Intelligence to provide the methodological counterpoint that allows us to transcend alarmism and construct a nuanced discourse on the role of these technologies in knowledge production.

Generative AI is here to stay. Its integration into academic workflows is now an irreversible reality, regardless of calls for caution or even moratoria. The pertinent question is not whether we will employ these tools, but how we will employ them. And the answer to this question depends entirely on the methodological rigor with which we approach their implementation.

Between the hallucination of the casual user and scientifically generated knowledge lies a methodological chasm that this article has sought to map and bring closer to our readers. It now falls upon the academic community, and especially the faculties of Documentation, to review curricula and undertake the necessary reforms to assume leadership in training current and future information professionals and researchers, so that they may navigate this territory with rigor. The alternative—allowing negligent practice to become the norm—is unacceptable both for the integrity of Science and for the credibility of our institutions.

References

Echazarreta, B.L. (2026, April 9). Academic journals fill up with AI “hallucinations”: “They have cited works of mine that do not exist.” ABC. https://www.abc.es/sociedad/revistas-academicas-llena-20260409025937-nt.html
LangChain. (2026). LangChain. https://www.langchain.com
Linardon, J.; Jarman, H.K.; McClure, Z.; Anderson, C.; Liu, C.; Messer, M. (2025). Influence of topic familiarity and prompt specificity on citation fabrication in mental health research using large language models: experimental study. JMIR Mental Health, 12, e80371. https://doi.org/10.2196/80371
LlamaIndex. (2026). LlamaIndex. https://www.llamaindex.ai
Microsoft. (n.d.). Semantic Kernel (Version 1.0.1). GitHub. https://github.com/microsoft/semantic-kernel
Naddaf, M., & Quill, E. (2026). Hallucinated citations are polluting the scientific literature. What can be done?. Nature, 652(8108), 26–29. https://doi.org/10.1038/d41586-026-00969-z