Reference

  1. Blázquez-Ochando, M.; Ovalle-Perandones, M.A. (2024). Semantic authority web project in PARES: extraction and initial analysis. Revista panamericana de comunicación, 6(1). https://doi.org/10.21555/rpc.v6i1.3121

Comment

The so-called Web 3.0 or semantic web represents a fundamental shift from the document-based web: data acquires meaning, and knowledge is built upon it. The layer model defined by Tim Berners-Lee (known as Semantic Web Layer Cake) establishes an architecture that ranges from URI identifiers to SPARQL query languages, passing through XML, RDF, RDFS, and OWL. In this architecture, data expressed using these technologies are called linked data, and when linked to other datasets, they form knowledge graphs, such as the well-known LOD Cloud.

Within the context of documentary institutions (LAM: Libraries, Archives and Museums), advances in semantic web technologies in Spain have been primarily focused on libraries. Key references include the linked data from the National Library of Spain (datos.bne.es), the Miguel de Cervantes Virtual Library (data.cervantesvirtual.com), and the Digital School Library of CITA. In the archival domain, however, initiatives have been more limited, although notable projects exist in municipal archives (Arganda del Rey, Burgos) and at the regional level (Documents and Archives of Aragon: DARA). At the European level, ontologies such as OAD, ArDO, or RiC-O have enabled the development of projects in archives in Italy and Germany.

The path toward semantic interoperability in archives must be situated within the Records in Contexts (RiC-CM) model, similarly to how semantic interoperability in libraries was grounded in FRBR. PARES, the Spanish Archives Portal, constitutes a fundamental platform for the retrieval of archival data. Managed by the Ministry of Culture, it aggregates eleven state-owned archives and, according to its own statistics, exceeds 77,000 authority records for families, institutions, persons, activities, places, concepts, standards, and single-person offices.

Methodology: Extraction and Analysis of Authorities

The research set forth a dual objective: to describe the types of authorities present in PARES up to the end of 2023, and to identify the network of relationships established among them, with the aim of mapping the portal’s knowledge graph. To this end, web-scraping techniques were employed. The program, developed in PHP using the cURL, DOM, and XPath libraries, systematically extracted all authority records based on their numeric identifiers. Each authority is assigned a URI of the type:

https://pares.mcu.es/ParesBusquedas20/catalogo/autoridad/[identificador]

The program recorded, for each authority, a comprehensive set of fields: type, URI link, authorized form, preferred and non-preferred terms, dates of existence, place of birth, death, residence, generic and related places, latitude, longitude, history, concepts and objects, legal attributions, occupations, related functions, specific terms, information sources, familial and associative relationships, external links, and related documents.

To facilitate subsequent retrieval, two indexing fields were created: one containing normalized text without stop words (for relevance-based searching) and another containing literal text (for exact matching).

Results: Quantification and Relationships

The extraction process, carried out on October 31, 2023, yielded 75,443 records distributed by type:


Type of authority

Number of records

Individuals

27.447

Places

27.004

Concepts

10.041

Institutions

9,397

Families

702

Standards / Laws

439

Single-position roles

358

Functions

54

Undefined

1

Personal authorities and places account for approximately 72% of the total entries, with over 27,000 records each. Concepts and institutions represent 13% and 12%, respectively. The remaining typologies have a marginal presence.

The analysis of relationships among authorities revealed that personal authorities form the center of the graph, acting as a pivot for most types of relations: familial relations (9,614), associative relations (33,378), occupations (35,295), concepts and objects (56,643), related places (48,371), and information sources (1,390). Places constitute the other fundamental axis, with significant presence in relationships involving persons, institutions, and norms.

Among the most significant findings:

  1. Predominance of individuals and families. The high frequency of familial relationships (9,614) enables the analysis of family ties, social life, and genealogy.
  2. Strong interconnection between places and people. The 48,371 relationships between places and people underscore the importance of geolocation and geographical context for understanding historical heritage.
  3. Associative relationships between individuals and institutions. The 33,378 associative relationships suggest a complex network of social and organizational interactions.
  4. Low proportion of relationships with information sources. Only 1,390 relationships with sources indicate an underutilization of documentary references in descriptive records.
  5. Limited interrelation of specific terms. They are almost exclusively tied to concepts and are not systematically employed in the description of families, places, individuals, or regulations.

Discussion: Toward a Knowledge Graph

In the context of LAMs, access points and authorities have historically been a central concern for professionals. Gracy (2015) argues that, in archival descriptions, frequency analysis of controlled and uncontrolled access points can leverage semantic technologies to develop enriched analytical methods for persons, families, organizations, geographic names, or other entities.

Niu (2016) suggests that projects implementing linked data for archival materials confirm that both descriptions and information retrieval are improved. This approach holds significant potential for effectively enriching archival data and enhancing its interoperability.

However, as Marciano et al. (2018) note, the production and consumption of documentary corpora have been influenced by data-centric social and industrial trends that bear little relation to more traditional archival methods. The transition toward linked data models therefore requires a methodological adaptation effort.

Conclusions

The main conclusion of the study is that the dominant conceptual relationships in PARES occur between persons and places, and between concepts and institutions. This demonstrates that geolocation and the understanding of the geographical context of authorities hold significant importance in the portal’s semantic graph. Relationships between persons and families enable the analysis of social life and genealogy, while associative relationships between persons and institutions open the door to data mining for discovering new and unexpected patterns and connections.

The research also reveals significant shortcomings: the least connected authorities are functions and single-person roles, whose connection is intrinsically tied to institutions and individuals. Specific terms exhibit little interrelation with authorities, being almost exclusively linked to concepts. According to the structure of descriptive cards, concepts and specific terms form a controlled vocabulary structured as a hierarchical thesaurus; however, this system does not appear to have been systematically employed in describing families, places, functions, persons, or regulations.

Through the description of authorities and the network of relationships, the knowledge graph of PARES has been mapped out, laying the groundwork for future developments in semantic archival web technologies in Spain. This work constitutes a first step toward publishing these data as linked data, thereby contributing to the integration of Spanish documentary heritage into the open data cloud.

Research Materials

  1. Blázquez-Ochando, M.; Ovalle-Perandones, M.A. (2024). PHP Scraping PARES. Data extraction function for PARES Authorities; Semantic construction prompts; SQL results. https://github.com/manublaz/phpSrapingPARES