ChatGPT & Public Library Data Mining Workshop

ConocimIA is pleased to announce its upcoming session dedicated to Artificial Intelligence. Due to the enthusiastic response and growing demand from attendees, Professor Manuel Blázquez has prepared a special lecture that will explain how ChatGPT works in an accessible manner for all audiences. The goal is to provide a detailed understanding that enables each individual to approach knowledge of this advanced technology and comprehend its scope and applications.

Date: December 15, 2023 / 10:30 AM–1:00 PM
Location: Lecture Hall, Faculty of Documentation Sciences, UCM
Admission: Free, subject to room capacity

Part One: Understanding How ChatGPT Works

Prof. Manuel Blázquez Ochando

What is GPT?

The session begins with an explanation of the acronym that gives the model its name: Generative Pre-trained Transformer. It is a generative system, meaning it is capable of producing original texts in response to user queries. Its nature is probabilistic: it calculates the probability of each word in its response and selects the most likely one based on the context built up to that point. In essence, ChatGPT is a word predictor in the form of a chat.

Differences with Traditional Information Retrieval

Both in Information Retrieval (IR) and ChatGPT, a user input is required. However, while in IR the interface is typically a search engine that returns a list of relevant documents, in ChatGPT the interface is conversational and the output is a generated text intended to directly address the user’s information need. In IR, no new documents are created; in ChatGPT, new texts are generated that can be considered new documents based on their length and context.

A common question among users is whether ChatGPT is limited to copying texts from the internet. The answer is no. When the system generates a response, it is not literally reproducing any existing document, but rather constructing a new sequence of words based on patterns learned during training. This explains why it is possible to search the internet for fragments generated by ChatGPT and find no exact matches.

The Role of Natural Language

Natural language is ambiguous, polysemous, and complex. Processing it requires advanced techniques that enable the machine to interpret intentions, extract relevant entities, and generate coherent responses. ChatGPT relies on deep neural networks—programs designed to perform highly complex tasks requiring classification and learning. Unlike traditional programs—with predefined rules, strict syntax, and well-determined workflows—neural networks learn the patterns and characteristics of natural language from examples.

Vectorization and Embeddings

One of the fundamental concepts for understanding ChatGPT is vectorization. In Information Retrieval, documents are represented as vectors in an n-dimensional space, enabling the calculation of similarity coefficients between documents and queries. ChatGPT takes this idea a step further: each word, phrase, or text fragment is converted into a vector—an embedding—of between 300 and 4096 dimensions. These vectors capture not only the presence of terms but also their semantic relationships: words with similar meanings tend to occupy nearby positions in the vector space.

The process of generating responses follows a sequence:

Analysis of the user's query
Vectorization of the query
Search in the list of embeddings in ChatGPT’s knowledge base
Application of a probabilistic model to determine which words are most likely to satisfy the context

The attention mechanism (Attention is All You Need)

The Transformer architecture, introduced in the seminal paper "Attention is All You Need" (Vaswani et al., 2017), constitutes the technical foundation of ChatGPT. The attention mechanism enables the model to weigh the importance of each word in relation to others, capturing long-range dependencies that traditional recurrent networks could not handle efficiently. This capability, combined with massive training, explains the fluency and coherence of the generated responses.

Step-by-step text generation

Text generation in ChatGPT is a sequential process. The model selects an initial tokentokens to continue the sequence, and so forth. At each step, it applies syntactic and grammatical rules to ensure the output is linguistically coherent. Sampling (sampling

This sequential construction process explains why it is difficult to distinguish whether a text was written by an AI or a human: the loss of information and reconstruction from original embeddings generate texts that, although derived from learned patterns, exhibit an appearance of originality.

Data and Scale

ChatGPT was trained on 570 GB of data from sources such as Common Crawl and OpenWebText2, processing 175 billion parameters. This scale, combined with the Transformer architecture, is responsible for its ability to maintain coherent conversations, answer complex questions, and perform diverse tasks without the need for task-specific programming.

Part Two: Data-Mining Public Libraries with ChatGPT

Prof. Pedro Lázaro Rodríguez

Context and Need

The second part of the session presents a practical case study on the application of ChatGPT in the field of library research. Research on public libraries requires access to reliable and up-to-date data. In Spain, sources such as the website Bibliotecas públicas españolas en cifras (BPEC) and the Estadística de Bibliotecas from CulturaBASE provide valuable information, but with significant limitations:

Data queries are organized by categories, not by complete networks
Obtaining data for all municipalities on a single variable may require more than 800 manual interactions
Data are not available in formats that facilitate aggregated analysis
Municipality-level information requires tedious manual processes

From wget to Python with ChatGPT

The speaker recounts their prior experience with tools like wget in Linux to automate downloads, and how ChatGPT 3.5 enabled them to transition to Python. The interaction with the model was iterative:

Initial request: assistance in creating a script to download data
Conversion: from Bash script to Python, using BeautifulSoup to parse HTML
Troubleshooting: resolving errors such as externally-managed-environment on Debian
Successive improvements:
Formatting codes with two digits ({:02d})
Extraction of titles from the summary attribute of tables
Splitting columns to separate variable and year
Handling Spanish numeric formats (comma as decimal, period as thousands separator)
Generating structured CSV and ODS files

PyDataBibPub: The resulting script

The script developed, named PyDataBibPub

Download of data for all 52 provincial codes (formatted as 01, 02, etc.)
Selection of specific variables, years, and geographic scopes
Consolidation into a single CSV file per variable, encompassing all selected years
Addition of columns for provinces and autonomous communities
Generation of a consolidated CSV file containing all variables
Creation of an ODS spreadsheet with organized tabs
Cleaning of rows with inconsistent data (e.g., annual reports with zero libraries)
Checking column consistency across different downloads

The final version reached 271 lines of instructions, requiring eight Python modules. The script, like the data source, was adapted to changes in the website’s structure, evolving into a second version (PyDataBibPub V2) when the original URL was modified.

Limitations and Legal Considerations

The use of these tools is not without restrictions. The Ministry of Culture’s legal notice stipulates that downloading content is limited to private use and expressly prohibits reproduction, distribution, or public communication without explicit authorization. This legal framework conditions the dissemination of the obtained data.

Reflections on the Use of ChatGPT

The presenter shares several reflections derived from their experience:

ChatGPT version 3.5, with an exploratory attitude, was sufficient to develop complex tools
The pedagogical value lies both in the outcomes and in the process of interaction with the model
ChatGPT makes errors, but it can correct them when provided with appropriate instructions and examples
Some learning is observed during the conversation: when shown the correct solution, it tends to apply it to similar cases
It is advisable to maintain context within the same conversation and open a new chat when the task changes
The tool is neither inherently good nor bad; ethics and morality are human categories, and technology acquires value according to how we use it

Third part: Data-mining in PARES (Spanish Archives Portal)

Prof. Manuel Blázquez Ochando

The Challenge of Extracting Authorities

PARES (Spanish Archives Portal) collects the identification, description, and digitization of documentation from Spanish historical archives. For research purposes, it is essential to be able to extract this data systematically. Webcrawlers are programs that enable downloading the content of target HTML pages, but traditionally have required advanced programming knowledge.

Methodology for Interacting with ChatGPT

This session presents a practical methodology for creating webcrawlers with the assistance of ChatGPT:

State the Objective: communicate that a webcrawler
Specify the programming language (PHP, Python, etc.)
Define the functions to be used (cURL, DOM, XPath)
Provide the HTML code of the page to be analyzed
Indicate the specific contents to be extracted
Ask for help politely

Error resolution and iterative learning

When the code presents errors, it is recommended to:

Show the HTML code fragment where the extraction fails
Provide the error message or warning returned by the server
Once resolved, continue with the development

It is important to acknowledge when the model's response is correct, in order to reinforce its learning. If it is observed that ChatGPT begins to lose context (when the conversation becomes too long), it is advisable to remind it of the main task.

Practical Application: Extraction of Authorities

The practical demonstration focuses on extracting authority data from PARES, using examples such as James Monroe, Philip II, Manuel Filiberto of Savoy... The generated code enables the automatic retrieval of cataloging information for these authorities, which can subsequently be processed for analysis or integration into other systems.

Conclusions on the Use of ChatGPT for Scraping

Prior knowledge of scraping is required to fully leverage the tool
ChatGPT greatly facilitates the task by enabling faster programming
It helps to debug bugs efficiently
It is a very useful tool for creating other tools
One cannot expect it to program everything without supervision (at least for now)
Prolonged interaction shows certain learning when taught with examples
Within the same conversation, it is advisable not to change the working context
When the conversation becomes too long, it is advisable to remind the agent of the task
A notable improvement is observed from the release of ChatGPT until the present date

General Conclusion

The joint session provides a comprehensive overview of ChatGPT’s potential in the field of Documentation Sciences. From understanding its internal functioning to its practical application in data extraction and analysis tasks, participants gain both conceptual foundations and applicable tools for their professional or academic work. The combination of theoretical explanations with practical case studies of data-mining in public libraries and authority extraction from PARES demonstrates the versatility of this technology and its capacity to enhance the capabilities of information professionals.

This session is part of the activities of the ConocimIA Seminar, a space dedicated to monitoring and analyzing artificial intelligence in the field of Documentation Sciences.

Conference Materials

The materials used in this session are available for download in PPTX and PDF formats. The presentation captures the ideas, references, and open questions raised throughout the conference and can serve as a starting point for further exploration of the topics discussed or for use in educational contexts, with proper attribution.

Blázquez-Ochando, M. (2023). How ChatGPT works. conocimIA_mblazquez_2023-12-15_como-funciona-chatgpt.pptx
Lázaro-Rodríguez, P. (2023). Data-mining of Public Libraries with ChatGPT. conocimIA_plazaro_2023-12-15_data-mining-bpe.pdf
Blázquez-Ochando, M. (2023). Data-mining of PARES with ChatGPT. conocimIA_mblazquez_2023-12-15_data-mining-pares.pptx