Investigación en Documentación: Aplicaciones académicas, cie

The technological research of Prof. Dr. Manuel Blázquez-Ochando is grounded in a central conviction: Documentation Science must develop its own tools to advance, rather than limiting itself to adopting those from other fields. Over more than fifteen years, this principle has led to a continuous trajectory of software, algorithm, and documentary system development, continually adapting to each new technological challenge—from content syndication to generative artificial intelligence.

The Origins: Syndication and Bibliographic Catalogs (2010)

The starting point was the doctoral thesis Applications of Syndication for the Management of Bibliographic Catalogs (UCM, 2010), awarded the Extraordinary Doctoral Prize. It demonstrated that RSS technology could be configured as an alternative to the Z39.50 protocol for the distribution and retrieval of bibliographic collections. The practical outcome was the Sync + Syncore platform, which included the first implementation of MARC-XML as a web service according to the specifications of the Library of Congress. This work laid the foundation for a research line that has accompanied all subsequent developments: the redissemination of content as a vector for documentary management.

Web Crawling, Webmetrics, and Media Analysis (2011–2014)

The second stage focused on the quantitative analysis of the web and digital media. The webcrawler Mbot enabled large-scale webometric studies, resource detection, data mining, and massive extraction of information sources. In parallel, the experimental platform ReSync initiated a line of automated content classification: starting from the multilingual Eurovoc thesaurus transformed into a functional ontology, five thematic classification algorithms were designed and applied to 400,000 news articles published by Spanish and Mexican media over a one-month period. The results, presented at the Hispano-Mexican Seminar on Library and Information Science, showed classification rates ranging from 1.8% to 99%, depending on the algorithm used, establishing a first-order quantitative benchmark in the field.

During the same period, APLIR was developed—a tool environment for teaching classical information retrieval models—and Ocelote, an encyclopedic manager for creating dictionaries and controlled vocabularies in documentary environments.

Specialized Search Engines and Big Data Aggregation (2013–2017)

The third stage explored the limits of general-purpose search engines and the potential of massive information aggregation. WauSearch was an experimental search engine designed to surpass major engines in comprehensiveness and specialization: it incorporated a proprietary ranking system based on similarity coefficients, advanced query assistance, 180 pre-configured searches for public administrations in 180 countries, and export of results in multiple formats. Although currently discontinued, it served as a fundamental testbed for studying user-system interaction in information retrieval.

Portudois extended this approach to a specific domain: Portuguese cultural heritage. It was the first integrated search engine for libraries, archives, and museums in Portugal, featuring semantic search to obtain enriched results. Its development was linked to a research stay at Universidade Nova de Lisboa (2017).

The aggregator AXYZ represented the most ambitious step in the treatment of informational Big-data: designed to process thousands of simultaneous RSS channels through five collaborative parsers, it incorporated automatic classification with boolean filters, calculation of news impact factors, analysis of correlation between sources, and interactive relational maps. Its architecture was described in the article Design of an Aggregator for the Management of Informational Big-dataEl Profesional de la Información (2016).

Digital Humanities and Epigraphic Heritage (2016–2021)

Within the framework of the research projects funded by HAR2015-63637-P and AVIPES-CM, the research activity expanded into Digital Humanities. Epibase was the resulting system for the management and cataloging of epigraphic documents, based on the EpiDoc standard and compatible with major classical epigraphy databases. In collaboration with Prof. Dr. Manuel Ramírez-Sánchez (ULPGC), EPIHUM was also developed—a database for the online cataloging of Renaissance epigraphy from Spain and Portugal, published in Epigraphy in the Digital Age (Archaeopress, Oxford, 2021). Both projects demonstrated the transferability of documentary methodologies to highly specialized humanities domains.

Artificial Intelligence Applied to Documentation (2023–present)

The most recent phase marks a shift toward artificial intelligence as both an object and a tool of documentary research. LaIAbot is a conversational agent based on a RAG (Retrieval-Augmented Generation) architecture, specialized in bibliographic recommendation and personalized reader assistance. It combines large-scale language models with deep understanding of documentary collections and represents the first application of the RAG paradigm developed specifically for the field of Library and Information Science in the Spanish language.

The Document Singularity Indicator (IS_d) addresses a novel and critical problem in the context of AI: the selection of training corpora. The index measures the degree of singularity of a document relative to a collection, providing a quantitative criterion to determine which documents contribute genuinely new knowledge to a specialized artificial intelligence system. This work, combined with prompt engineering techniques for bibliographic scraping, opens an unprecedented research line at the intersection of scientometrics and AI systems.

Parallelly, tools such as ScholarDown — a system for massive extraction of publications from Google Scholar using advanced anti-detection techniques — and phpScrapingPARES — designed for the analysis of authorities within the Spanish Archives Portal — consolidate the massive information retrieval approach as a foundation for bibliometric research and knowledge graph construction.

The formative dimension of this line is complemented by the promptAI repository, which makes available to the research community the prompts designed and documented within the framework of the author’s scientific publications, as well as participation in the seminar ConocimIA, a forum for discussing the impact of AI on Documentation Sciences, whose activities were published in MÉI: Métodos de Información (2024).

Ongoing Projects

Current research continues along several simultaneous directions: refining the IS_d index for application to larger documentary collections; developing new capabilities for LaIAbot focused on information retrieval in archives and museums; and extending sentiment analysis and automatic classification techniques to the context of Spanish-language scientific information. The author’s GitHub repository is continuously updated and reflects the current status of these developments.

GitHub Repository: https://github.com/manublaz
Research Portal: https://mblazquez.es
ORCID: https://orcid.org/0000-0002-4108-7531