AI documentation: Who documents what AI documents?

The integration of generative artificial intelligence into the processes of production, management, and dissemination of knowledge has modified traditional documentary practices. Large language models (LLMs) generate texts, bibliographic summaries, and systematic reviews with high efficiency, but without explicit tracking of the provenance, context, or veracity of the produced content. This absence of epistemic traceability poses a challenge for Documentation Science: if AI generates documentary information, what mechanisms record its origin?

The adoption of artificial intelligence in knowledge management implies epistemological transformations beyond technical aspects. Zhang, Zuo, and Yang (2025) identify that the use of generative AI in organizational environments amplifies data biases, generates information overload, and fosters technological dependency—problems that are not resolved through improvements in interfaces or algorithmic performance, but rather through knowledge governance frameworks that guarantee authenticity. Al Halbusi et al. (2025) point out that the efficacy of AI-generated results in green innovation depends on the quality of training data and contextual validation mechanisms, which directly links the utility of knowledge to its documentary genealogy.

The crisis of traceability in algorithmic generation

AI models function as black boxes: they produce coherent outputs without revealing the sources, inference processes, or associated levels of uncertainty. This contrasts with the fundamental principles of scientific documentation, where citation, reference, and verification are necessary conditions for credibility. Traditional metadata—author, date, source, resource type—are insufficient to represent the complexity of content generated by algorithmic systems.

Alavi and Leidner's (2001) systematic review on knowledge management establishes that the conversion of implicit knowledge into explicit knowledge requires rigorous encoding mechanisms. When this process is automated through AI trained on non-auditable data—anonymous forums, texts with ambiguous licenses, or collections lacking metadata—encoding becomes opaque. Nonaka's SECI model, designed to explain knowledge creation in human organizations, loses applicability when one of the agents—AI—lacks intentionality and contextual awareness, yet influences the construction of shared knowledge (Nonaka & Takeuchi, 1995; Zhang, Zuo & Yang, 2025).

Proposal: Algorithmic Origin Metadata (AOM)

The implementation of a metadata type termed Algorithmic Origin Metadata (AOM) is proposed, intended to record three essential components in each AI-generated fragment: the model used, the training sources, and the level of confidence associated with the produced information. This metadata does not replace existing citation systems but rather complements them in contexts where documentary production is automated.

Identified Model: Name and version of the model (for example, GPT-4o-2024-05-13, Llama-3-70b-Instruct, Claude-3.5-Sonnet), along with the execution environment (local API, cloud platform, custom instance).
Training Source: Datasets used for its training, identified via persistent URIs (DOI, repository URL) and cutoff dates. Example: Common Crawl 2023-47, arXiv 1990–2023, PubMed Central 2000–2024, Wikipedia es 2024-06-15>.
Confidence Level: A quantitative or qualitative value expressing the model's certainty regarding the veracity of the information, derived from internal metrics (prediction probability, entropy, consensus among multiple inferences) or external validation (expert evaluation, comparison with reliable sources).

The MOA can be integrated into standardized formats such as Dublin Core, Schema.org, or BIBFRAME, extending existing elements with specific attributes for algorithmic entities. In an AI-generated bibliographic record, the dc:creator field could reference a URI that describes the model; dc:source would include the training corpora; and dc:confidence would express a numerical value between 0.1 and 1.0, or a qualitative category such as “low”, “medium”, or “high”. To illustrate its practical application, the following is an example of how the MOA could be integrated into JSON-LD syntax (commonly used by Schema.org):

{

"@context": "https://schema.org/",

"@type": "ScholarlyArticle",

"name": "AI-Generated Literature Summary",

"creator": {

"@type": "SoftwareApplication",

"name": "GPT-4o",

"softwareVersion": "2024-05-13"

"moa:algorithmicOrigin": {

"trainingSource": [

"https://doi.org/10.1016/j.techfore.2024.123897",

"PubMed Central 2000–2024"

"confidenceLevel": 0.88

}

Implications for Knowledge Management and Documentary Ethics

The adoption of the MOA redefines roles within the documentary chain. The librarian must assess not only the relevance and quality of resources but also the algorithmic transparency of AI-generated content. Digital libraries, institutional repositories, and research centers must establish policies requiring the inclusion of the MOA in all materials generated by artificial intelligence, particularly those intended for teaching, research, or critical decision-making.

This proposal responds to the notion of mutable and contextual knowledge articulated by McInerney (2002): if knowledge is transformed by new data, its origin must be documented dynamically. An AI-generated text may shift in meaning following retraining; the MOA enables the tracing of such evolution. In domains such as public health or higher education—where errors carry tangible consequences—the capacity to audit the algorithmic provenance of a diagnosis, a recommendation, or a bibliographic summary constitutes both a technical and an ethical imperative.

The integration of the MOA with explainable artificial intelligence (XAI) systems and automated verification tools—such as AI-generated content detectors that analyze lexical entropy patterns or n-gram distributions—enables the construction of more resilient documentary ecosystems. These systems may issue alerts when a fragment originates from unverified sources or when the confidence level falls below the acceptable threshold for academic use.

The implementation of the MOA could foster the creation of public registries of AI models, analogous to clinical trial registries, wherein not only architectures are documented, but also detected biases, known limitations, and datasets excluded on ethical grounds. Such systemic transparency represents a step toward algorithmic accountability in knowledge production, and its scalable adoption will depend upon collaboration among librarians, AI engineers, and science policy makers.

References

Al Halbusi, H., Al-Sulaiti, K. I., Alalwan, A. A., & Al-Busaidi, A. S. (2025). AI capability and green innovation impact on sustainable performance: Moderating role of big data and knowledge management. Technological Forecasting and Social Change, 210, 123897. https://doi.org/10.1016/j.techfore.2024.123897
Alavi, M., & Leidner, D. E. (2001). Review: Knowledge management and knowledge management systems: Conceptual foundations and research issues. MIS Quarterly, 25(1), 107–136. https://doi.org/10.2307/3250961
McInerney, C. (2002). Knowledge management and the dynamic nature of knowledge. Journal of the American Society for Information Science and Technology, 53(12), 1009–1018. https://doi.org/10.1002/asi.10109
Nonaka, I., & Takeuchi, H. (1995). The knowledge-creating company: How Japanese companies create the dynamics of innovation. Oxford University Press.
Zhang, Q., Zuo, J., & Yang, S. (2025). Research on the impact of generative artificial intelligence (GenAI) on enterprise innovation performance: a knowledge management perspective. Journal of Knowledge Management. https://doi.org/10.1108/JKM-07-2025-0995