Prompt Engineering for Bibliographic Web Scraping

Reference

Blázquez-Ochando, M.; Prieto-Gutiérrez, J.J.; Ovalle-Perandones, M.A. (2025). Prompt engineering for bibliographic web-scraping. Scientometrics, 130(7), 3433-3453. https://doi.org/10.1007/s11192-025-05372-5

Comment

Bibliographic catalogs store millions of records that constitute an invaluable source for bibliometric and documentary research. However, access to them is not always possible via protocols such as OAI-PMH or SRU/SRW. In many cases, the only viable option is web-scraping: the automated extraction of data from the catalogs’ web pages. Traditionally, this task required advanced programming knowledge and an iterative development process that could extend over days or weeks.

The emergence of large language models (LLMs) such as ChatGPT has opened new possibilities. However, as we have seen in previous entries of this blog, the quality of AI responses depends largely on how we formulate our questions. This leads to the central question of our research: Is it possible to design a prompt that enables obtaining, in a single interaction with the AI, a fully functional web-scraping program tailored to the specificities of a bibliographic catalog?

The Methodology: Structured Prompt Engineering

The article proposes a method for designing advanced prompts based on five essential components:

Definition of the role: specify to the AI the role it must adopt (in our case, a programmer specialized in web-scraping with PHP) and the required competencies.
Context and purpose: clearly describe the problem to be solved, the development environment (Apache server, PHP, MySQL), and the expected level of detail.
Inputs and constraints: specify the functions to be used (e.g., curl instead of file_get_contents), data selection methods (such as XPath), and the conditions it must adhere to.
Examples of input and output: provide fragments of the target page's source code and demonstrate how the data should be extracted.
Detailed steps: break down the workflow that the program must follow, acting as a procedural guide that the AI can adhere to.

This approach is supported by recognized prompt engineering techniques such as Role Prompting (assigning a specialized role) and Few-shot Prompting (learning from examples), combined with a markdown structure that facilitates the AI's attention layer in retaining relevant information.

Validation: The Catalog of the National Library of Spain

To test the effectiveness of the method, we selected the catalog datos.bne.es of the National Library of Spain as a case study. We designed two types of prompts:

A control prompt
An advanced prompt

The results were conclusive. The control prompt generated code that did not function, based on incorrect assumptions about the HTML structure of the records. In contrast, the advanced prompt produced, in a single interaction, a fully functional PHP program capable of extracting bibliographic metadata (title, author, place of publication, publisher, date, physical description, etc.) with a very high level of accuracy.

Scaling and Stress Testing

A crucial aspect of the research was verifying that the generated code worked not only for a single record but could be scaled to tens of thousands. To this end, we used a second prompt (this time directed at Claude Sonnet 3.5, another AI) that took the code generated by ChatGPT and added the necessary functions to:

Connect to a MySQL database.
Iterate over the record identifiers (more than 60,000 links).
Handle null or empty values.
Insert the data into a structured table.
Respect a 3-second pause between insertions to avoid overloading the server.

The resulting program processed 62,786 links, of which 7,313 turned out to be 404 errors (non-existent pages), and 55,473 valid records were collected. The extraction was completed in 63 hours (including scheduled pauses), with an effective rate of 66.76 records per minute and an average completeness index of 85.43%—indicating that, overall, the records contain a substantial amount of information.

Discussion: Practical Implications

The results of this work have several relevant implications:

For researchers. The developed methodology eliminates the need for advanced programming skills. A researcher requiring data extraction from a bibliographic catalog can, following the proposed structure, obtain a functional program within hours rather than weeks of development or by resorting to costly commercial tools.
For libraries and documentation centers. The ability to generate custom scrapers facilitates tasks such as data migration between systems, synchronization of catalogs with national or international repositories, or identification of inconsistencies in records. It is a viable alternative when access to standardized protocols is unavailable or technical resources are limited.
For teaching documentation. Prompt engineering is emerging as a relevant competency in the training of future information professionals. The ability to formulate precise instructions for language models is, in itself, a technical skill that can make the difference between obtaining useful results or not.

Limitations and Future Work

The article also highlights certain limitations. The method has been validated primarily using the catalog of the National Library of Spain, although the principles should be applicable to other catalogs. The generated code is limited to PHP and the Apache/MySQL environment, though it can be adapted to other languages. Additionally, the results are tied to specific versions of the models used (ChatGPT-4o and Claude Sonnet 3.5), and the evolution of these models may affect the behavior of the prompts.

Among the lines of future work are extending the method to other programming languages, exploring use cases across different types of catalogs, and investigating techniques to further improve accuracy in identifying complex bibliographic fields.