AI in Library Science: First AI for Documentation

It is a pleasure to share our upcoming ConocimIA event on Artificial Intelligence and Documentation. It is a great honor to present a significant achievement in the field of artificial intelligence within the domain of Documentation: the development of the first AI specifically for Library and Information Science, built from its most fundamental foundations to configure a research-adapted query service, without relying on ChatGPT or any other proprietary technologies. Other initiatives have failed to deliver a fully operational service, or else remain dependent on third-party services beyond the control of the documentation professional. In this case, for the first time, complete control over the entire process, the full processing chain, and its administration down to the end user is achieved.

Date: April 26, 2024 / 5:00–7:00 PM
Location: Conference Room, Faculty of Documentation Sciences, UCM
Admission: Free until capacity is reached. Attendance certificate available (fill out form)

Context and Motivation

The development of this AI has been the result of several months of work during which a Llama and Mistral model was adapted to the purposes of Documentation Sciences. To achieve this, considerable resources were allocated to enable the research and implementation of this system, developing a comprehensive and specialized learning model and content selection. What makes this AI truly exceptional is its ability to understand and recognize documents and respond to complex questions within our scientific discipline. This initiative arises from the identification of several structural problems in the use of commercial generative AIs:

Privacy and data leakage: communications with models such as ChatGPT involve surrendering information to external servers, posing risks to sensitive data or ongoing research.
Vendor dependency: access to these services is subject to corporate policies, changes in terms of use, and potential disruptions.
Algorithmic bias and opacity: commercial models are trained on datasets that are not always transparent, which may introduce unwanted biases.
Lack of source attribution: ChatGPT and similar models do not provide explicit references to the documents underlying their responses, limiting their utility for academic research tasks.
Lack of specialization: general-purpose models are not optimized for the specific needs of Documentation Sciences, with their own terminology, methods, and documentary corpora.

Faced with these limitations, the project proposed a radical alternative: building a proprietary AI, hosted on local servers, trained on a knowledge base curated by specialists, and fully controlled by the research team.

Part One: The Problem of Sources in AI

Prof. Pedro Lázaro Rodríguez

The first part of the event addresses one of the most complex challenges in artificial intelligence systems: the accurate identification, citation, and representation of information sources. Professor Lázaro presents a comparative analysis based on his experience with various generative tools and their evolution in handling documentary references.

Previous Experience with ChatGPT 3.5

During the initial sessions of ConocimIA, Professor Lázaro had presented PyDataBibPub, a Python script for extracting data from Spanish public libraries, developed with the assistance of ChatGPT 3.5. At that time, the tool proved effective for programming but revealed a fundamental limitation: when asked about the sources of information used, ChatGPT provided evasive responses.

In response to the question "What sources have you used for this information?", the answer was systematic:

"My response is based on general knowledge about the topic, as well as an understanding of the importance of these concepts. I have not consulted specific sources to provide this information, as it derives from my knowledge and understanding of the subject as an AI trained by OpenAI."

This opacity, justified by the model's design, represents a significant obstacle for academic research, where traceability of information is an essential requirement.

Alternatives with explicit citations

In light of this limitation, the speaker explores alternatives that incorporate source citation as part of their functionality. Tools such as Perplexity, Phind, Komo AI, You.com, Microsoft Copilot, Elicit, Scispace, Scite, and Scopus AI offer different approaches to this problem:

Perplexity and Phind: incorporate specific "Sources" or "References" sections that list the documents consulted to generate the response, with links to the original sources.
Scite and Scispace: specialized in scientific literature, provide information on the context of citations (whether an article has been supported, challenged, or contrasted).
Elicit: focused on literature reviews, enables extraction of structured information from scientific articles along with their corresponding references.

Comparative Study: Three Concrete Cases

To evaluate the differences between tools, the presenter presents three test cases:

Case 1: Generation of a Python script for QR codes. ChatGPT 3.5 provided the code without citing sources. Perplexity and Phind offered similar solutions, accompanied by a list of sources—Python documentation pages, tutorials, GitHub repositories—that allow verification and further exploration of the information.
Case 2: Indicator of "system power" in the evaluation of libraries. ChatGPT 3.5 provided a generic definition based on its "general knowledge." In contrast, Perplexity provided a list of academic articles, including the work by Lázaro-Rodríguez and López-Gijón (2020) on the adaptation of the system power indicator from the Secaba-Rank methodology. The tool was able to identify the presenter’s own work as a relevant source.
Case 3: AMPdoc Software. In response to the question regarding AMPdoc—a free software package developed at the Faculty that bundles Apache, MySQL, PHP, and tools such as PMB, Koha, Greenstone, Omeka, or ArchiMatica—ChatGPT 3.5 provided a generic response about "integrated library management systems," failing to correctly identify the specific tools. In contrast, Perplexity and Phind provided the exact list of components, along with their corresponding sources.

Conclusions of the First Part

The comparison reveals a significant evolution in the ability of AI tools to handle documentary references:

More recent models incorporate source citation as part of their design, responding to growing demands for transparency and verifiability.
Domain specialization improves the accuracy of references: tools such as Scite or Scispace, trained on scientific corpora, offer more reliable results for academic research.
The problem of sources is not merely technical: it has epistemological, ethical, and legal implications, particularly in contexts of scientific research and publication.

The speaker’s final reflection connects with recent news about technology giants taking shortcuts to obtain data for training their models, altering their own guidelines and, in some cases, bypassing copyright law. Against this extractive logic, an alternative is proposed based on local control, transparency, and conscious selection of sources: "AI in your hands."

Second part: The first Documentation AI

Prof. Manuel Blázquez Ochando

The second part of the event unveils for the first time the first fully functional Documentation AI, developed as a service for research. The speaker presents his development experience, challenges, issues, advantages, and future developments, accompanied by a practical demonstration of its functionality.

The idea: A dedicated AI for Documentation

The project draws inspiration from prior initiatives such as PrivateGPT, an open-source software created in May 2023 that enabled the local execution of language models without an internet connection, with the capability to load documents and maintain private conversations. The goal was to advance this technology further: not only to run a model locally, but to specialize it in Documentation Sciences, train it on a curated corpus of academic literature, and transform it into an accessible service for researchers.

The motivations were both practical and ethical:

Privacy: ensuring that research data never leaves controlled servers.
Autonomy: avoiding dependence on external providers' policies.
Specialization: training the AI on discipline-specific literature.
Transparency: ability to trace the sources of each response.
Technological sovereignty: demonstrating that it is possible to develop indigenous AI from universities.

The media: hardware and software

Development required a significant investment in computational resources. The final infrastructure consists of:

Main server: capable of running large language models.
Base software: PrivateGPT as the system core, with embedding and LLM models.
Language models: Adaptations of Llama and Mistral, configured to respond in Spanish and optimized for the Documentation domain.
Development environment: Visual Studio Community, Python with Anaconda, Chocolatey for package management, and CMake for compilation.

Installation: a complex process

Setting up the system required multiple steps:

Installation of Visual Studio Community with environments for Python, C++, and C#.
Installation of Chocolatey as a package manager for Windows.
Installation of CMake as a requirement for PrivateGPT.
Download and placement of the PrivateGPT source code from GitHub.
Installation of Anaconda/Miniconda3 to manage the Python environment.
Configuration of environment variables for Python and CMake.
Creation and activation of a dedicated virtual environment for PrivateGPT.
Installation of Poetry and Pipx for dependency management.
Execution of the installation script that automatically downloads embedding, tokenization, language, vectorization, and processing models.
Configuration of the user profile and launching of the server on port 8001.

Configuration and Customization

Once installed, the system allows extensive customization:

Connection port
Result ranking values
Authorized access
Number of sources and contents analyzed
Document storage directory
Coefficient for result calculation
Instructions for AI behavior
Embedding model
LLM model
Document ingestion mode
Context window
Connection to PostgreSQL databases
Maximum number of tokens
Configuration data for APIs
Tokenization model
AI creativity level (temperature)

Two operating modes

Document search mode. Functions as a search engine for all documents uploaded to the system. Particularly useful for identifying who cited or commented on a specific claim in scientific articles. Provides a ranked list of results based on similarity, applying natural language processing and information retrieval techniques.
LLM Chat Mode. Enables a conversation similar to ChatGPT. By default, it does not consider the context of uploaded files, but it can be configured to utilize information from documents.

Knowledge Base Preparation

One of the critical aspects of the project was the selection of documents to feed the AI. The adopted strategy included:

Identification of Reliable Sources: peer-reviewed journal articles with high impact factors, documentation from reputable publishers, and content relevant to the thematic areas.
Coherent Structuring: organization of documentation into a two-level hierarchical thematic menu, with design of cross-cutting facets for cognitive interlinking.
Compatible formats: preferably HTML, XML, TXT, PDF, DOCX, CSV, PPTX.
Inclusion of examples and solved cases: examples of cataloging, documentary analysis, classification, thesauri, encoding, programming.
Specialized glossaries, dictionaries, and encyclopedias: to enhance relational capacity and improve classification.
Cross-references: leveraging bibliographic citations from scientific papers to improve ranking and embedding calculations.
Secondary and tertiary documents: inclusion of bibliographic reference lists to reinforce the cross-referencing effect.
Peer review: validation of document selection by academics in the field.

The resulting knowledge base covers subjects such as Information Retrieval, Deep Learning, Natural Language Processing, Semantic Web, Programming (PHP, Python, Java, JavaScript), Archival Science, Document Analysis, Cataloging, Document Languages, Thesauri, Bibliometrics, Scientometrics, and other areas of Documentation Sciences.

Creation of an Open Service: The Mayordomo Program

One of the greatest challenges was transforming PrivateGPT—a local and private tool—into a service accessible to multiple users. The adopted solution was to develop an intermediate layer, named "Mayordomo," which acts as an interface between the user and the AI.

The resulting architecture is a dual client-server paradigm:

PrivateGPT manages requests and returns responses via JSON through its API.
Steward (developed in XAMPP) manages user interaction by processing requests, establishing response order, anonymizing query logging, enabling communication export, collecting user experience data for self-learning, and monitoring errors.

This development opens new possibilities for Documentation Sciences: the ability to deploy specialized AI services, albeit requiring adequate infrastructure to ensure speed and efficiency.

The Problem of Sources and References

One of the fundamental advantages of Steward/PrivateGPT is the use of sources selected by the administrator themselves. The system indicates the files used to generate each response, and since it does not connect to the internet to gather new sources, their location is straightforward to determine.

However, when asked to reference sources within an explanation, it may make errors. To mitigate this issue, solutions have been implemented, such as the use of regular expressions (REGEXP) to recognize references and citations in documents, training the AI to identify citation patterns (e.g., Harvard format) and to link reference text with topics, papers, documents, and contexts.

Future Developments

The project is not yet complete. Planned developments include:

Self-learning: Mayordomo will be able to manage the AI’s self-learning process, with content auto-updating based on user interaction.
Webcrawler: Integration of a crawler to auto-feed the AI with newly selected content.
Expansion of Sources: Drastic expansion of the selection of sources and documentation resources.
Interface improvements: real-time streaming response.
AI submodules: for user interaction personalization.
New training methods: for more effective teaching of PrivateGPT.

Conclusions of Part Two

AI will become an extension of human capabilities, providing human originality and intention with the speed and precision of AI.
Natural language communication and its proper articulation are key to optimal performance.
The documentalist will specialize in disambiguation, context identification, process definition, and organization.
Not everything will be immediately automated, but it is only a matter of time until it is achieved and refined.
It is foreseeable that AI will eventually generate a large number of complex processes by chaining specialized GPTs.
The limit with AI is not so much what it cannot do, but how we should ask.

Final Reflection

The event concludes with a reflection on the significance of this achievement. In a context where major technological giants compete to dominate the artificial intelligence market, the initiative to develop an AI system from within the university represents an alternative grounded in local control, transparency, and disciplinary specialization.

The "first AI for Documentation" is not merely a technical milestone; it is a demonstration that it is possible to build artificial intelligence systems that respect privacy, cite their sources, and serve academic research. Ultimately, it is a step toward an AI that is literally "in your hands."

The lecture is part of the activities of the ConocimIA Seminar, a space dedicated to monitoring and analyzing artificial intelligence in the field of Documentation Sciences.

Conference Materials

The materials used in this session are available for download in DOCX, PPTX, and PDF formats. The presentation summarizes the ideas, references, and open questions raised during the lecture and can serve as a starting point for further exploration of the topics discussed or for use in educational contexts, provided proper attribution is given.

Lázaro-Rodríguez, P. (2024). The Problem of AI and Sources. conocimIA_plazaro_2024-04-26_problema-ia-fuentes.pdf
Blázquez-Ochando, M. (2024). AI in Your Hands: The First AI in Documentation. conocimIA_mblazquez_2024-04-26_primera-ia-documentacion.pptx