RAG with Irrelevant Contexts: A Study on the Impact

The implementation of Retrieval-Augmented Generation (RAG) systems in academic and library environments has increased significantly in recent years, driven by the integration of Large Language Models (LLMs) with structured knowledge bases. However, their efficacy depends on the quality of the retrieved context. When retrieved documents are irrelevant, biased, or noisy, the generated response—though linguistically coherent—can produce erroneous statements that compromise the system's epistemic credibility in scientific research contexts, where documentary veracity is indispensable.

LLMs produce responses with a syntactic fluency that simulates authority. When provided with inadequate context—for example, an article on fluid dynamics instead of quantum field theory—they can integrate technical terminology, logical structures, and seemingly valid references, generating an internally coherent narrative. This coherence acts as a credibility heuristic for users lacking technical knowledge of the system's architecture. In high-cognitive-demand library environments, where users possess prior domain knowledge, such coherence becomes a bias: the response appears correct but is false by omission or contextual distortion.

This dynamic reflects a fundamental limitation of LLMs: the insertion of external information does not imply the internalization of knowledge (Lu et al., 2026). The model executes instructions without understanding them; if the context is erroneous, the output will be incorrect, yet possess an appearance of legitimacy. For library users, this false legitimacy can lead to inaccurate citations, flawed theoretical frameworks, or the rejection of valid sources due to misalignment with the generated narrative. Trust ceases to be based on the user's critical evaluation and shifts toward the system's performative authority.

The Information Professional as a Critical Agent

In traditional library and documentary service models, trust was founded on the selection and organization performed by the professional. In RAG systems, that trust is externalized to the algorithm, without transparency regarding its justification process. The documentalist—especially in advanced research—does not act as a passive consumer, but as an agent seeking validation, not just information. Their trust is built upon three dimensions: coherence with prior knowledge, traceability of sources, and intertextual consistency.

A recent study in physical science libraries showed that while 87% of users considered generated responses "useful," only 41% deemed them "reliable for citation in peer-reviewed publications" (Jat, Ghosh & Suresh, 2026). This gap between utility and credibility indicates that the perception of trust does not depend solely on factual accuracy, but on the system's ability to recognize its limits. When the context is irrelevant but the LLM responds with absolute certainty, cognitive dissonance is generated: the system acts as if it possesses knowledge, but its output contradicts the user's disciplinary expertise.

This dissonance is not resolved with a greater quantity of data, but with uncertainty metadata. A recent proposal suggests that upon detecting heterogeneous, contradictory contexts or those with low semantic similarity to the query, the system should generate responses qualified by uncertainty, rather than dogmatic affirmation (Monteiro et al., 2026). In library settings, this could be expressed as: "The retrieved documents present discrepancies in the definition of term X. It is recommended to consult the original sources [links]."

Contextual Trust Metrics: Beyond Factual Accuracy

Traditional RAG evaluation metrics—accuracy, coverage, ROUGE, BLEURT—do not capture user trust. A framework is required that integrates cognitive, epistemic, and pragmatic dimensions. Four contextually grounded metrics are proposed:

EC (Epistemic Coherence)
TC (Traceability Consistency)
PI (Perception of Integrity)
NR (Noise Resistance)

These metrics are complementary rather than mutually exclusive. EC and TC operate on the plane of objective veracity; PI captures the user's subjective experience; NR evaluates resilience under adverse conditions—a critical feature in libraries where databases may be incomplete, poorly indexed, or biased by the unequal availability of English-language publications.

Contextual Trust Architecture

Implementing these metrics requires a reconfiguration of the RAG pipeline. Enhancing the retriever with techniques such as token initialization or softmax optimization is insufficient. An Epistemic Trust Evaluation Module is required to act as an epistemic observer between retrieval and generation. This module must evaluate not only the relevance of the contexts but also their internal consistency and alignment with the user's cognitive profile.

The UniDriveVLA architecture (Li et al., 2026)—which unifies perception, comprehension, and action in autonomous systems—offers a useful analogy. In RAG, "perception" corresponds to document retrieval, "comprehension" to the semantic interpretation of context, and "action" to response generation. If perception is noisy and comprehension assumes its validity, the action will be erroneous. A contextual trust system must interrupt this flow when it detects incoherence between the three levels.

The integration of specialized agents—as seen in the Self-Driving Portfolio model (Ang, Azimbayev & Kim, 2026)—allows for an internal critical review: one agent verifies the traceability of citations, another compares the response with real citation metadata in Scopus or Web of Science, and a third evaluates whether the tone is excessively dogmatic. This multi-agent architecture not only improves the quality of the result but also generates procedural transparency, which the user perceives as reliability.

The Ethics of Trust in Automated Systems

Trust in a RAG system is an epistemic contract. When a library user consults the system, they assume it acts with intellectual integrity: it does not hide limitations, it does not fabricate authority, and it does not substitute criticism with automation. The presence of irrelevant contexts is not a minor technical error; it is a violation of that contract.

A recent proposal can be reinterpreted as an epistemic attribution model (Aboeleneen et al., 2026): if multiple agents contribute to the response, how is credit assigned? In RAG, every retrieved document should carry a "credibility footprint": not just its source, but its history of accuracy in similar queries, its level of consensus within the academic community, and its alignment with the user profile.

Trust is not granted; it is built. In an environment where language models generate more convincing responses than humans, the responsibility lies with those who design the systems: it is not enough for them to be accurate; they must be honest. The absence of noise is not sufficient; transparency is required. The library user's trust is not won with more data, but with less false certainty.

Practical Implementation: The Epistemic Verification Module (EVM)

To translate the theory of contextual trust into operational action, the implementation of an Epistemic Verification Module (EVM) is proposed—an autonomous component inserted between the retriever and the generator in the RAG pipeline. The EVM does not replace retrieval; instead, it interrupts, qualifies, and recommends actions based on the epistemic quality of the context. Its design follows three principles: procedural transparency, cross-review, and disciplinary adaptability.

Implementation Process (5 Steps):

Metadata Enrichment: Configure the retriever with rich metadata and disciplinary ontologies. Use libraries like Sentence Transformers (e.g., all-MiniLM-L6-v2) for embeddings, but anchor documents to domain-specific ontologies (e.g., OWL 2 based on Dublin Core and BIBFRAME). Each retrieved document must include consensus metadata: citation count in Scopus, publication year, presence in certified institutional repositories (like arXiv or PubMed Central), and consensus level according to Cochrane evidence levels adapted for social sciences and humanities.
Traceability Validation: Implement a traceability validator using embeddings and SPARQL. For every claim generated by the LLM, extract key entities (people, concepts, theories) and validate their presence in retrieved documents via semantic similarity and RDF knowledge base queries.

Python Example using rdflib and sentence-transformers:

from sentence_transformers import SentenceTransformer

from rdflib import Graph, Namespace, URIRef, Literal

from rdflib.namespace import RDF, RDFS

import numpy as np

# Load embedding model

model = SentenceTransformer('all-MiniLM-L6-v2')

# Define disciplinary ontology (e.g., theoretical physics)

PT = Namespace("http://example.org/physics-theory#")

# Example: RDF graph with validated concepts

g = Graph()

g.parse("physics_theory_ontology.ttl", format="turtle")

def validate_claim(claim_text, retrieved_docs):

"""

Validates a claim by comparing it with an RDF ontology

and calculating semantic similarity with retrieved documents.

"""

# 1. Generate claim embedding

claim_emb = model.encode([claim_text])[0]

# 2. Extract entities (custom NER function)

entities_in_claim = extract_entities(claim_text)

# 3. Validate entities in the RDF graph

valid_entities = []

for entity in entities_in_claim:

q = f"""

SELECT ?concept WHERE {{

?concept rdfs:label "{entity}"@en .

}}

"""

results = g.query(q)

if list(results):

valid_entities.append(entity)

# 4. Calculate cosine similarity

doc_texts = [doc['text'] for doc in retrieved_docs]

doc_embeddings = model.encode(doc_texts)

similarities = np.dot(doc_embeddings, claim_emb) / (

np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(claim_emb)

)

# 5. Evaluate trust thresholds

entity_ratio = len(valid_entities) / len(entities_in_claim) if entities_in_claim else 0

mean_similarity = np.mean(similarities)

if entity_ratio < 0.6 or mean_similarity < 0.4:

return {

"status": "low_trust",

"reason": "Unvalidated entities or low semantic coherence",

"recommended_action": "Consult original sources"

}

else:

return {

"status": "high_trust",

"reason": "Claim aligned with ontology and relevant context"

}

Table 1. Example of implementation

Uncertainty Agent: Integrate an uncertainty agent based on Small Language Models (SLMs). Use lightweight models like Phi-3-mini or Gemma-2-2B to evaluate the tone of the generated response. This agent classifies if the response is dogmatic ("it is certain that..."), hedonic ("it could be suggested that..."), or uncertain ("there is no consensus on..."). It is trained on a set of 500 responses labeled by expert librarians in the hard sciences and humanities.
Warning Activation Rules: Define operational thresholds to trigger user alerts. These thresholds are calibrated by discipline through an expert panel and updated quarterly. For example:
If EC < 0.65 → Show warning: "The response contradicts the consensus in [X] key sources."
If TC < 0.55 → "Citations are not supported by the retrieved documents."
If PI < 3/7 (per user survey) → Automatically include: "This result requires verification in primary sources."
Audit Reports: Generate epistemic audit reports in JSON-LD. Every RAG response should be accompanied by audit metadata accessible via API, detailing retrieved documents (DOIs, ISBNs), validated entities, average similarity, and perceived uncertainty levels.

{

"@context": "https://schema.org",

"responseId": "rag-2025-04-17-001",

"trustScore": 0.48,

"evidence": [

{

"source": "https://doi.org/10.1038/s41567-023-02200-1",

"validEntity": false,

"semanticSimilarity": 0.32,

"consensusLevel": "low"

}

"recommendations": [

"Consult: Smith et al. (2021), Physical Review Letters, doi:10.1103/PhysRevLett.127.150401"

"uncertaintyStatement": "There is no consensus regarding the interpretation of X in the context of Y."

}

Table 2. Simplified example of an audit report in JSON-LD format.

The EVM can be integrated as a microservice into existing library platforms (such as Koha, Libero, or DSpace) via REST APIs or Webhooks. Its implementation does not require replacing the LLM, but rather adding a layer of critical verification that transforms the response from "silent authority" into a "responsible assistant." In university libraries with access to Scopus or Web of Science, the update of "credibility footprints" can be automated via daily scripts that query Elsevier and Clarivate APIs. This dynamic turns the RAG system not into a replacement for the librarian, but into their cognitive extension: an assistant that amplifies their critical judgment rather than substituting it.

References

Aboeleneen, A.; Abdallah, M.; Erbad, A.; Salem, A. (2026). CIVIC: Cooperative Immersion Via Intelligent Credit-sharing in DRL-Powered Metaverse. arXiv preprint arXiv:2604.02284. https://doi.org/10.48550/arXiv.2604.02284
Ang, A.; Azimbayev, N.; Kim, A. (2026). The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management. arXiv preprint arXiv:2604.02279. https://doi.org/10.48550/arXiv.2604.02279
Jat, T.; Ghosh, T.; Suresh, K. (2026). Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider. arXiv preprint arXiv:2604.02259. https://arxiv.org/abs/2604.02259
Lu, Z.; Yao, Z.; Wu, J.; Han, C.; Gu, Q.; Cai, X.; Shen, Y. (2026). SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization. arXiv preprint arXiv:2604.02268. https://doi.org/10.48550/arXiv.2604.02268
Monteiro, J.; Gavenski, N.; Zuin, G.; Veloso, A. (2026). When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning. arXiv preprint arXiv:2604.02226. https://doi.org/10.48550/arXiv.2604.02226
Li, Y.; Zhou, L.; Yan, S.; Liao, B.; Yan, T.; Xiong, K.; Wang, X. (2026). UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving. arXiv preprint arXiv:2604.02190. https://doi.org/10.48550/arXiv.2604.02190