Systematic reviews (SRs) constitute the methodological standard for evidence synthesis in disciplines such as medicine, psychology, education, and social sciences. Their validity depends on comprehensiveness in searching, transparency in study selection, critical assessment of quality, and integrated synthesis of findings. However, the exponential growth of scientific output—over two million articles annually in databases such as PubMed, Scopus, or Web of Science—has increased operational burden, prolonging timelines and exposing the process to errors due to fatigue or cognitive biases. Optimization of the workflow through automated technologies must preserve epistemological rigor without compromising methodological integrity.

Large language models (LLMs) and semantic retrieval systems offer capabilities to handle repetitive tasks in the SR process. However, their implementation cannot be reduced to replacing human agents with algorithms. Experience in document automation shows that opaque systems—lacking traceability or human oversight—generate epistemological risks: biases inherent in training data, false positives in classification, erroneous data extraction, and decontextualized synthesis. The solution is not full automation, but a hybrid architecture in which AI acts as an operational assistant while the researcher retains final responsibility for critical decisions.

This collaborative model is not novel in principle. Since the inception of documentary computing, keyword-based retrieval systems or automatic classification have always required human validation. What has changed is the scale and sophistication: models such as GPT, Llama, or Claude can process texts in multiple languages with accuracy surpassing the human average in bounded critical reading tasks. However, their knowledge is limited by the training window, their reasoning is probabilistic, and they lack intentionality or contextual awareness. The validity of a SR does not depend on the degree of automation, but on the location of the human control boundary.

Recent studies in computer science and documentation indicate that, even in technical domains where models outperform humans in precision, final interpretation still requires expert validation. Likewise, computer vision models achieve high efficacy in detecting AI-generated images, but their generalization depends on the quality of the training set and the human definition of decision thresholds. In all cases, optimal performance is achieved when AI acts as a high-sensitivity filter and the human as a verifier of precision and meaning. This dynamic is applicable to the systematic review process.

A hybrid workflow structured in three phases—search, screening, and synthesis—is proposed, where AI assumes the operational burden under human supervision, validation, and decision-making. Each phase is analyzed considering the current technical capabilities of models, their epistemological limitations, and the protocols necessary to ensure reproducibility, transparency, and scientific integrity. The goal is not to replace expert judgment, but to enhance it through asymmetric collaboration: AI handles repetitive tasks; humans handle interpretive ones.

The search phase: From Boolean operators to semantic retrieval with human feedback

Systematic search is the critical stage of a SR, aiming to identify all relevant studies and exclude irrelevant ones according to pre-defined criteria in the protocol. Traditionally, it relies on strategies using Boolean operators (AND, OR, NOT), controlled terms (MeSH, Emtree), and the researcher’s experience in formulating complex queries. These strategies are limited: they depend on terminological precision, ignore semantic relationships, and require multiple iterations to balance sensitivity and specificity.

Contextual representation models enable searches based on natural language (e.g., “What is the effect of cognitive behavioral therapy on adolescents with social anxiety?”), overcoming the constraints of Boolean operators. Platforms such as Rayyan, Covidence, or DistillerSR integrate semantic embeddings that identify synonyms, hyponyms, and conceptual relationships. A query such as “antidepressants in children” can retrieve studies using “pharmacological treatment for pediatric depression” without requiring the explicit inclusion of every terminological variant.

The key advantage lies in the capacity for human feedback-based learning (RLHF). Rather than executing a static query, the researcher begins with an initial query, reviews the first results, and labels articles as relevant or irrelevant. The model dynamically adjusts the ranking based on these signals. Studies show that this iterative approach improves precision compared to static searches, even with less sophisticated models.

However, models may confuse homonyms, misinterpret intentions, or prioritize articles by citation count rather than methodological relevance. For example, a query on “effects of exercise on depression” might favor observational studies with large samples, excluding rigorous randomized clinical trials that are less cited. Moreover, models may reflect biases present in the existing literature: if most studies originate from high-income countries, research from the Global South may be systematically underrepresented. To mitigate these risks, a three-level validation protocol is recommended:

  1. Level 1: Definition of conceptual scope. Before the search, the team must develop a semantic map of key concepts, including synonyms, abbreviations, and terms in multiple languages. This map guides the model’s query generation.
  2. Level 2: Iteration with controlled feedback. At least three rounds of search with human feedback are conducted, each using a distinct set of articles marked as relevant. All decisions must be documented for subsequent audit.
  3. Level 3: Exhaustiveness Verification. Following automated searching, a complementary manual search is conducted in specialized databases (LILACS, SciELO, Dialnet) or thesis repositories. If key articles are omitted by the system, the strategy is recalibrated.

The system must not operate as a black box. Each retrieved article must be accompanied by an explanation of its relevance: Which terms matched? What semantic relationship was established with the research question? Traceability is essential for methodological transparency and replicability. Platforms that integrate interpretable AI (XAI) explanations into information retrieval offer a reference model.

The process documentation must be comprehensive: each query, iteration, and adjustment must be recorded in a standardized format (JSON-LD or BibTeX with extended metadata) and archived alongside the results. This complies with PRISMA standards and enables future audits or reuse of the strategy.

Title and Abstract Screening: From Binary Classification to Contextual Inference

After article retrieval, the screening phase begins. Traditionally, this stage requires independent reading of titles and abstracts by two reviewers, with inter-reviewer agreement (Cohen’s kappa) ranging between 0.6 and 0.8, implying discrepancies that require resolution by a third reviewer. AI can assume this task with superior efficiency, but only if designed as an aid, not a replacement.

Supervised classification models (e.g., BERT or SciBERT) trained on human-labeled datasets achieve accuracies exceeding 90% in binary tasks. However, accuracy alone is insufficient. Relevance must be assessed with logical coherence and contextual consistency. An article may mention “clinical trial” yet be a case study without a control group; a keyword-based model would incorrectly classify it as relevant.

The hybrid proposal implements a two-level screening process. In the first level, AI classifies articles according to protocol criteria, issuing not only a binary label but also a confidence score and an explanation based on influential tokens (e.g., “randomized detected in abstract,” “population: elderly adults”). In the second level, human reviewers evaluate only articles with intermediate confidence scores (0.6–0.85) or those contradicting disciplinary expectations. This strategy reduces workload by up to 70% without compromising sensitivity, according to pilot studies in public health.

The model must be adapted through active learning: each human correction feeds the system as new training data, progressively improving its discrimination without full retraining. The AI evolves alongside the team. Critical risks include the internalization of biases in training data (e.g., dismissal of non-indexed journals), failures in specific linguistic or cultural contexts (e.g., colloquial language in Brazilian studies), and inability to assess ethical implications. An article may be methodologically sound yet violate fundamental ethical principles. To mitigate these risks, it is recommended:

  1. Define explicit ethical exclusion criteria. For example: “exclude studies without informed consent” or “exclude interventions in vulnerable populations without ethical safeguards.” These criteria must be programmed as immutable logical rules.
  2. Implement cultural bias review. The team must periodically review articles rejected by the AI to identify systematic patterns (e.g., recurrent exclusion of texts in Spanish or Portuguese).
  3. Document all exclusion decisions. Each excluded article must record the reason: automated classification, human decision based on methodological criteria, or duplication.

The system must generate an automated screening report detailing: number of articles evaluated, classified by AI, reviewed by humans, resolved discrepancies, and their resolution. This report must be integrated into the SR registry for external audits.

Data Extraction: From Manual Reading to Structured Inference with Consensus Verification

Following study inclusion, data extraction begins: the systematic collection of information on methodology, population, intervention, outcomes, and quality. Traditionally, this is performed using standardized forms (Cochrane, PRISMA) completed manually, which is slow, prone to transcription errors, and susceptible to inter-reviewer variability.

LLM-based information extraction (IE) models can identify key entities: sample size, type of intervention, quantitative outcomes (OR, RR, 95% CI), and measurement instruments. For example, given the summary “In a randomized trial of 120 participants, the intervention group showed a mean reduction of 4.2 points (95% CI: -6.1 to -2.3) on the Hamilton Anxiety Scale,” a well-trained model can extract:

  1. Study: Randomized trial
  2. Sample size: 120
  3. Intervention: treatment X
  4. Outcome: reduction on the Hamilton scale
  5. Effect: -4.2
  6. 95% CI: [-6.1, -2.3]

Platforms such as RobotReviewer, ASReview, or Rayyan’s extraction module achieve accuracies above 85% in well-structured clinical studies, but their performance declines in social sciences or disciplines with complex narratives. The issue is not merely technical accuracy, but semantic interpretation. What does “a reduction of 4.2 points” mean? Scales ranging from 0–10 or 0–100? Absolute or relative change? AI does not comprehend numerical context or measurement scale. Therefore, automated extraction must be followed by mandatory human verification.

A “consensus-based extraction” model is proposed: AI presents data in a structured interface; the human reviewer validates or corrects it. If two reviewers disagree, the system flags the field as “disputed” and requires resolution by a third party. This approach reduces extraction time by more than 60% without compromising quality.

Additionally, logical verification must be implemented: if the AI extracts that “the intervention group had n=10 and the control group n=5,” but the original text states that both groups were equal, the system must issue an alert. This requires models to comprehend logical relationships within the text (e.g., “70% were assigned to group A, implying that 30% were in group B”).

AI can also detect inconsistencies across studies. If three studies report RR=1.2 and one reports RR=3.5, the system should suggest manual review of the discrepant study to verify whether the discrepancy stems from an extraction error or a genuine finding. This capability is particularly valuable in reviews with high heterogeneity. To ensure integrity, every extracted datum must be linked to its textual source. The system must generate an “evidence map” showing the exact article fragment used for each extraction, exportable in formats such as BRAT or standoff annotation, and stored alongside the structured data.

In qualitative or mixed-methods reviews, AI can identify recurring themes through semantic clustering. For instance, if multiple studies mention “fear of stigma,” “difficulty accessing services,” and “lack of family support,” the AI can group them under an emergent theme such as “social barriers to treatment.” However, interpretation—naming, meaning, hierarchy—must be performed by the research team. The AI suggests; the human decides.

Narrative and Quantitative Synthesis: AI as a Draft Writer, Not an Author of Conclusions

Synthesis integrates evidence to answer the research question. In quantitative reviews, it involves meta-analysis; in qualitative reviews, it entails thematic or narrative synthesis. AI can assist in both, but its role must be strictly auxiliary. In meta-analysis, models can automate data preparation: calculating pooled effects, generating forest plots, detecting publication bias through funnel plots or subgroup analyses. Platforms such as R, RevMan, or JASP allow integration of data extracted by AI into statistical workflows. However, interpretation—What does a small but significant effect mean? Is it clinically relevant? Is publication bias real or an artifact of the design of included studies?—requires disciplinary expertise.

In narrative synthesis, AI can generate drafts of sections such as “Characteristics of Included Studies” or “Key Findings,” combining extracted data with predefined structures. For example: “Twelve studies involving a total of 2,845 participants were included. The primary intervention was cognitive behavioral therapy (n=9), followed by mindfulness (n=3). The mean sample size was 237 participants (range: 45–680).” This synthesis is accurate but lacks interpretation. Why is it relevant that most studies used CBT? What does the wide range in sample size imply? AI cannot answer these questions. It merely describes what is present in the data. Its function should be to draft a structured first version; the human researcher transforms it into a critical narrative. This process is analogous to using tools like Grammarly or Hemingway: they serve as a starting point, not the final product.

Moreover, AI can identify patterns of heterogeneity that are not immediately apparent. For instance, if larger studies report null effects while smaller ones report positive effects, this suggests a possible relationship between sample size and effect magnitude. This implication must be investigated by the research team: Is it a methodological artifact? Publication bias? A genuine difference?

In qualitative reviews, models can cluster thematic citations using semantic embeddings and generate conceptual maps. However, interpretive coding—determining which theme is central, which is secondary, and how they interrelate—requires hermeneutic understanding. AI can state: “There are 14 citations related to ‘loneliness’ and 9 related to ‘social support.’” The human researcher must conclude: “Loneliness is the central experience, and social support functions as a mediator, not a solution.” The final synthesis cannot be generated by AI; it must be written by the researcher, supported by tools that reduce cognitive load and enhance coherence. AI is the draft writer; the human is the author of interpretation.

Transparency, Auditing, and Reproducibility: The Ethical Framework of the Hybrid Copilot

The adoption of AI in systematic reviews cannot be opaque. Science demands traceability, and systematic reviews are a cornerstone of evidence-based practice precisely because of their transparent, traceable process. A hybrid workflow must adhere to higher standards of documentation, not lower ones.

We propose an integrated traceability protocol that includes:

  1. Query logging: Every version of the search query, with date and the user who modified it.
  2. Classification Log: Record of each article classified by the AI, including confidence score and explanation.
  3. Human Decision Log: Every correction, exclusion, or inclusion made by a reviewer must be linked to their identity and date.
  4. Extraction Map: Linkage between each extracted datum and the original textual fragment from the article.
  5. Model Versioning: Clear identification of the AI model used (name, version, parameters), as well as the training or fine-tuning dataset applied.

This log must be stored in an open and standardized format (JSON-LD or RDF) and archived in open-access repositories (Zenodo, Figshare) alongside the final report. This enables replication of the process, auditing of decisions, or reuse of data for future reviews. Additionally, a protocol for external auditing must be established. Before publication, an independent third party must review the hybrid workflow: Was the protocol followed? Were all decisions documented? Did the AI operate within defined boundaries? This audit serves as an epistemic guarantee. The AI must not be a “black box” escaping control, but rather a transparent and audited component.

Ethics requires acknowledging the limits of technology. Not all studies are suitable for automation. In disciplines with high narrative, cultural, or linguistic complexity—such as anthropology, history, or philosophy—the AI may be inadequate or harmful if applied without critical scrutiny. The proposal is not universal; it is contextual.

On the other hand, the illusion of objectivity should be avoided. AI is not neutral. It is trained on data produced by humans, which reflect historical, linguistic, and geopolitical biases. A model trained on PubMed may systematically ignore scientific literature outside the Anglo-Saxon sphere. A screening system trained on U.S.-based studies may fail to recognize culturally specific interventions originating from other regions, journals, or scientific dissemination channels. Transparency alone is insufficient: constant critical evaluation of data, models, and decisions made using them is required.

The researcher’s training: From user to AI manager

Adopting a hybrid workflow entails not only the implementation of technological tools but also a transformation in the training of researchers. The information professional who uses AI as a copilot cannot remain a passive user; they must become an intelligence manager. This requires new competencies that we must begin to analyze:

  1. Understanding the limits of AI. Knowing when to trust and when to doubt. Recognizing that 90% accuracy is insufficient if the cost of a false negative is high.
  2. Ability to design validation protocols. It is not enough to use a tool; one must define how its output will be verified.
  3. Knowledge of metadata and standardization. Understanding how data are structured so that AI can process them correctly.
  4. Critical interpretation skills. Ability to read AI explanations, question its inferences, and correct its errors.

Universities and research centers must integrate these competencies into research methodology training programs. Courses on “AI for Systematic Reviews” should be strongly recommended in graduate programs. The goal is to train critical researchers who can collaborate with LLMs without submitting to algorithmic principles. AI does not replace disciplinary competence or the researcher’s capacity for reflection and critical analysis. This proposal makes sense especially when humans learn to define the purpose, direction, or research pathway, drawing on their solid foundation of knowledge in the scientific field to know what questions to ask, how to validate results, and when to be skeptical. This makes it a challenging, yet not impossible, endeavor.

Practical implementation: Operational protocol for hybrid workflows in systematic reviews

The transition from a theoretical model to an operational practice requires a standardized, scalable, and auditable protocol. Below is a proposal for articulating or implementing this framework in academic or institutional settings, grounded in the use of real tools, open standards, and documented best practices.

Steps for implementation:

  1. Define the AI protocol with the team’s signature. Before initiating the search, the research team could sign a document detailing: (a) which tasks will be automated; (b) which decisions are exclusively human; and (c) the confidence thresholds for human intervention (e.g., scores < 0.6 or > 0.85 are accepted without review; scores between 0.6–0.85 require validation). This document is attached as an appendix to the SR protocol.
  2. Configure the work environment with open and auditable tools. It is recommended to use a technology stack based on free and standardized software:
  3. Semantic search using an embeddings API.
  4. Screening and classification using software that supports active learning with models such as BERT, Logistic Regression, SVM, or other relevant ones, and that exports processing logs in JSON-LD format to enable subsequent performance analysis.
  5. Data extraction using local LLMs such as Llama 3 8B via Ollama.
  6. Storage and traceability, recording all logs in JSON-LD format according to the W3C Web Annotation schema.
  7. Generate and archive the decision log in JSON-LD. Each human or automated action must be recorded with minimal metadata:
{
"@context": "https://www.w3.org/ns/anno.jsonld",
"id": "urn:rs:2025:decision:001",
"type": "Annotation",
"body": {
"value": "Exclude by criterion 4.2: not a controlled trial",
"purpose": "classification"
},
"target": "https://doi.org/10.1016/j.jpsychores.2023.110567",
"creator": {
"name": "Dr. Ana López",
"orcid": "https://orcid.org/0000-0002-1825-0097"
},
"created": "2025-03-15T10:22:00Z",
"agent": {
"type": "Software",
"name": "ASReview v1.8.2",
"version": "1.8.2",
"model": "scibert",
"confidence": 0.73
},
"reason": "The model detected 'randomized' and 'control group' in the abstract, but random assignment is not mentioned in the methods."
}
  1. Automate the generation of the audit trail report. Use a Python script to consolidate all logs into a single audit-trail.jsonld file and generate a summary in PDF format using weasyprint:
import json
from weasyprint import HTML

# Load ASReview and Rayyan logs
with open('asreview_logs.json', 'r') as f:
logs = json.load(f)

# Generate audit HTML
html_content = '<h1>RS Audit Report</h1><p>Total decisions: ' + str(len(logs)) + '</p><ul>' + "".join([f"<li>{log['target']} — {log['body']['value']} (Confidence: {log.get('agent', {}).get('confidence', 'N/A')})</li>" for log in logs]) + '</ul>'

# Export to PDF
HTML(string=html_content).write_pdf("audit_report.pdf")
  1. Validate linguistic and geographic coverage. Use SPARQL to query databases of datasets and research supplements. An example can be found in resources such as PMC Open Access Subset.
SELECT ?title ?language WHERE {
?article dc:title ?title ;
dc:language ?language .
FILTER (CONTAINS(STR(?title), "human genome") && ?language = "en")
} LIMIT 100
  1. If relevant studies not included are detected, the team must review major scientific databases to complete the documentation process of the research.
  2. Conduct a pre-publication external audit. Assign an independent researcher (not involved in the SR) the role of “AI auditor.” This auditor must verify:
  3. That all logs are archived and accessible on Zenodo.
  4. That no human decision has been overlooked due to automation error.
  5. The model used has not been fine-tuned with undisclosed data.
  6. The auditor signs an integrity certificate that accompanies the article.

This protocol is not a magic recipe, but a minimum framework for operating responsibly. Its adoption requires a culture of documentation, not merely technology. AI does not improve SR by itself; it does so when integrated into rigorously documented, audited, and human-supervised processes.

References

  1. Anjum, K., Arshad, M. A., Hayawi, K., Polyzos, E., Tariq, A., Serhani, M. A., ... & Shahriar, S. (2025). Domain specific benchmarks for evaluating multimodal large language models. arXiv preprint arXiv:2506.12958. https://doi.org/10.48550/arXiv.2506.12958
  2. Brîncoveanu, C., Carl, K. V., Witzki, A., & Hinz, O. (2026). Augmenting Systematic Literature Reviews: A Human-AI Collaborative Framework. In German Conference on Artificial Intelligence (Künstliche Intelligenz) (pp. 3-17). Springer, Cham. https://doi.org/10.1007/978-3-032-02813-6_1
  3. Correia, A., Grover, A., Jameel, S., Schneider, D., Antunes, P., & Fonseca, B. (2023). A hybrid human–AI tool for scientometric analysis. Artificial Intelligence Review, 56(Suppl 1), 983–1010. https://doi.org/10.1007/s10462-023-10548-7
  4. Ni, S., Chen, G., Li, S., Chen, X., Li, S., Wang, B., ... & Yang, M. (2025). A survey on large language model benchmarks. arXiv preprint arXiv:2508.15361. https://doi.org/10.48550/arXiv.2508.15361
  5. Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., ... & Moher, D. (2021). The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. bmj, 372. https://doi.org/10.1136/bmj.n71
  6. Peters, D., Vold, K., Robinson, D., & Calvo, R. A. (2020). Responsible AI—two frameworks for ethical design practice. IEEE Transactions on Technology and Society, 1(1), 34–47. https://doi.org/10.1109/TTS.2020.2974991
  7. W3C. (2017). Web Annotation Vocabulary. https://www.w3.org/TR/annotation-vocab/
  8. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., ... & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1), 1-9. https://doi.org/10.1038/sdata.2016.18
  9. Zuo, C., Yang, X., Errickson, J., Li, J., Hong, Y., & Wang, R. (2025). AI-assisted evidence screening method for systematic reviews in environmental research: integrating ChatGPT with domain knowledge. Environmental Evidence, 14(1), 5. https://doi.org/10.1186/s13750-025-00358-5