SEMTEST (Semantic Enrichment Test) is an interactive demonstrator designed to illustrate, step by step and in a fully transparent manner, the process of semantic enrichment of document queries. Starting from a single term or free phrase entered by the user, the system automatically deploys a nine-phase pipeline that traverses the main open knowledge infrastructures of the Semantic Web: Wikipedia, Wikimedia Commons, DBpedia, Wikidata, and Open Library.

▶ SEMTEST Semantic Query Enrichment Demo

https://mblazquez.es/lab/semTest/

The goal is not merely to retrieve information about a concept, but to demonstrate how that concept can be progressively enriched through the aggregation of descriptors, categories, semantic relationships, graphical resources, associated bibliographies, and formal representations in RDF standards—all with visible real URLs, raw JSON responses, and source code for each call.

The tool originated as an experiment linked to the Portudois search engine project and as a teaching demonstrator for the Advanced Information Retrieval Techniques course. Thanks to the use of Artificial Intelligence in 2026, its rapid renewal has been possible, significantly expanding its technical and pedagogical scope.

The Concept of Semantic Enrichment in Documentation

Semantic enrichment of queries is a central process in modern information retrieval systems. It is based on a simple premise: the term entered by a user into a search engine is generally ambiguous, incomplete, and decontextualized. “Artificial intelligence,” “Natural psychology,” or “Renaissance” are merely examples of textual labels that acquire full meaning when related to a structured set of concepts, categories, properties, and associated entities.

In the field of Library and Information Science, this process directly connects with established traditions such as subject analysis, authority control, thesauri, and ontologies. The novelty introduced by the Semantic Web environment is that these knowledge structures are no longer confined within proprietary tools but are published as open linked data (Linked Open Data) and accessible via standard APIs. SEMTEST transforms this access into a pedagogically visible process.

The Nine Phases of the Semantic Pipeline

The SEMTEST pipeline is structured into nine sequential phases, each of which adds a layer of enrichment to the original concept. The student can follow each step in real time thanks to a progress bar that reflects the status of each phase, and can inspect the URLs used, the full JSON responses, and the source HTML code of each queried source.

Phase 1 — Search via the MediaWiki API on Wikipedia

The first phase converts the term entered by the user into a structured query against the official Spanish Wikipedia JSON API (action=query&list=search&srwhat=text). The system retrieves up to ten articles related to the concept, ranked by relevance, with information on their size, number of words, and a text snippet that allows identifying the relevance of each result.

The first article in the list becomes the reference article (top-1) for the remainder of the pipeline. This mechanism replicates the behavior of automatic entity disambiguation systems (entity linking), which select the most probable meaning of a term based on context.

Figure 1. In phase 1, all articles related to the search topic are retrieved

Phase 2 — Extraction of contents from the top-1 article

Once the reference article is identified, two additional calls are made to the Wikipedia API:

  1. A first call with prop=extracts|links to obtain the plain text of the article (without HTML markup) and the internally linked terms (up to 500 namespace 0 entities). These terms constitute an initial network of related concepts.
  2. A second call with action=parse&prop=sections to extract the article sections (headings), which serve as implicit thematic descriptors of the concept.

The output of this phase is an emergent documentary vocabulary: the set of entities that Wikipedia considers sufficiently relevant to link from the article, plus the thematic structure of the article itself.

Figure 2. The first article from the search is retrieved, obtaining headings, paragraphs, sections, and linked terms

Phase 3 — Extraction of backlinks

Backlinks are Wikipedia articles that link to the reference article. Their extraction is performed using prop=linkshere with the numeric page ID (pageID), which is Wikipedia’s stable identifier, independent of changes to the article title.

In documentary terms, backlinks are equivalent to reverse reference citations in a thesaurus: articles that mention the concept as part of their primary content. The set of backlinks indirectly defines the scope of application and domains of use of the concept.

Figure 3. The backlinks of the consulted article are shown

Phase 4 — Retrieval of multimedia resources from Wikimedia Commons

Wikimedia Commons is the free multimedia repository that supports all Wikimedia projects. This phase queries its API using action=query&list=search&srnamespace=6 to locate image files related to the term. Subsequently, a second call with prop=imageinfo&iiprop=url&iiurlwidth=180 retrieves the actual URLs of the thumbnails and descriptive pages for each file.

The results are presented as an interactive visual gallery. Each image links to its page on Commons, where the student can access metadata regarding authorship, license, and categories. From a documentary perspective, the retrieved images constitute iconic representations of the concept, complementary to the textual representations from previous phases.

Figure 4. The multimedia content related to the query is shown

Phase 5 — Retrieval of the semantic resource from DBpedia

DBpedia is the structured and linked version of Wikipedia: a knowledge graph that extracts properties from information boxes (infoboxes) and categories of Wikipedia articles and publishes them as RDF triples.

This phase operates in two steps. The first queries the DBpedia Lookup Service (https://lookup.dbpedia.org/api/search) to obtain the canonical URI of the resource in the DBpedia graph, thereby resolving issues of ambiguity, redirections, and orthographic differences between the query term and the exact designation used in DBpedia. The second step accesses the HTML page of the resource to extract its title, introductory note, summary (dbo:abstract in Spanish), thematic categories (dct:subject), ontological types (rdf:type), and external links.

The distinction between the canonical RDF URI (http://dbpedia.org/resource/X) and the web navigation URL (https://dbpedia.org/page/X) is one of the fundamental concepts of Linked Data: the URI is the identifier of the resource in the RDF graph's namespace; the URL with /page/ is the HTML representation of that resource for human consumption.

Figure 5. Presents the semantic information linked to the query in DBpedia

Phase 6 — Retrieval of the Semantic Resource in Wikidata

Wikidata is the structured knowledge base of the Wikimedia projects, distinct from DBpedia in that it is actively maintained by a community rather than automatically derived from Wikipedia. This phase utilizes the Wikidata API in two steps:

  1. wbsearchentities to locate the Q identifier of the concept (e.g., Q11660 for Artificial Intelligence).
  2. wbgetentities with props=claims to extract all claims (statements) pointing to other Q entities. Subsequently, a batch resolution of labels (labels) in Spanish and English converts Q identifiers into human-readable names.

The result is a dense network of entities related to the concept: people, places, disciplines, technologies, works, institutions. This network is qualitatively different from Wikipedia’s: while Wikipedia’s internal links are editorial and contextual, Wikidata’s relationships are explicitly typed and form part of a formal ontological model.

Phase 6 also includes a viewer of the Wikidata Query Service iframe, which executes a visual SPARQL query over the entity’s relationships.

Figure 6. SPARQL query for the search topic, identifier, and semantically related entities

Phase 7 — Direct SPARQL Query to the DBpedia Endpoint

This phase executes an actual SPARQL query against the public DBpedia endpoint (https://dbpedia.org/sparql), retrieving up to 60 triples of the resource filtered by language (Spanish, English, and unlabeled literals) and by value type (literals and URIs). The results are presented in a structured table with three columns: property (abbreviated name with tooltip showing the full URI), value (linked when it is a URI), and data type.

A relevant technical aspect with explicit pedagogical value: the DBpedia SPARQL endpoint requires the HTTP header Accept: application/sparql-results+json, which differs from the generic MIME type application/json. This distinction is visible in the system interface, where the exact header used in the call is displayed.

The contrast between Phases 5 and 7 illustrates the difference between HTML access (web of documents) and RDF/SPARQL access (web of data) to the same information.

Figure 7. The canonical URL of the searched entity and its public web page are shown, along with the SPARQL query used to retrieve its RDF triples.

Phase 8 — Bibliographic enrichment via Open Library

Open Library is the open bibliographic catalog of the Internet Archive, containing over 20 million records of works. This phase queries its search API (https://openlibrary.org/search.json) to retrieve up to twelve works related to the queried concept.

For each work, the title, authors, year of first publication, and cover image (from https://covers.openlibrary.org) are retrieved. Results are presented as a grid of visual, clickable bibliographic cards that link to the full record page on Open Library.

This phase establishes a bridge between the semantic enrichment of the concept and its bibliographic materialization: the textual resources that have articulated, developed, or debated this concept over time. From a documentary perspective, it is the phase that connects the conceptual analysis process with classical bibliographic management tools.

Figure 8. Additionally, a bibliographic enrichment query is performed in Open Library, related to the original search and entity

Phase 9 — Visualization of the semantic graph and RDF export

The final phase synthesizes all previous results into two outputs: D3.js Interactive Graph. A force-directed graph, built using the D3.js v7 library, visually represents the relationships between the central concept and the nodes retrieved in prior phases. Nodes are color-coded according to their source (Wikipedia, DBpedia, Wikidata, Open Library) and sized according to their relevance. The graph is fully interactive: it supports zooming, panning, and dragging of individual nodes. Clicking on any node opens the corresponding resource URL.

Export in four standard formats of the Semantic Web:

  1. JSON-LD — a JSON-based linked data format with Schema.org, DBpedia, and Wikidata context. Ready to be embedded in web pages as structured metadata.
  2. Turtle (.ttl) — RDF serialization with prefixes schema:, dct:, and owl:, directly importable into a triplestore.
  3. JSON Graph (D3) — the node and edge structure in JSON format, reusable for other visualization projects.
  4. SVG — the vector image of the graph as displayed on screen, downloadable for publication or presentation.

Figure 9. Semantic graph showing relationships or linked-data concerning the queried entity

Main changes compared to the first version

The original version of SEMTEST was a functional demonstrator designed to illustrate the concept of semantic enrichment, but it suffered from significant technical limitations that eventually rendered it inoperable. The renewal in version 2.3 has been comprehensive: this is not a collection of patches, but a complete rewrite focused on robustness, extensibility, and pedagogical quality.

From HTML scraping to official JSON APIs

This is the most profound change. The original version constructed Wikipedia search URLs directly as web interface URLs, using parameters such as profile=advanceddefault, which became invalid with MediaWiki updates. Backlinks were obtained by accessing the Special:WhatLinksHere page, whose HTML structure also changed over time. Version 2.3 replaces all these fragile calls with the official MediaWiki and Wikidata JSON APIs:


Function

Original Version

Version 2.3

Wikipedia Search

HTML scraping with profile=advanceddefault

action=query&list=search&srwhat=text

Article Content

HTML XPath scraping

prop=extracts|links + action=parse&prop=sections

Backlinks

Scraping of Special:WhatLinksHere

prop=linkshere&lhnamespace=0

Wikimedia Images

HTML Scraping (broken relative URLs)

action=query&list=search + prop=imageinfo&iiprop=url

DBpedia

Manual URL /page/Term

Lookup API → Canonical URI → Data page

Wikidata

HTML Scraping of the entity page

wbsearchentities + wbgetentities&props=claims|labels

SPARQL

Did not exist

Direct call with Accept: application/sparql-results+json

Open Library

Did not exist

API search.json

Fixing cURL timeouts and User-Agent

A technical issue affecting all phases was the configuration of func.curl.php. The original getcurl1() function had a CURLOPT_CONNECTTIMEOUT of 1 second and a CURLOPT_TIMEOUT of 4 seconds, values too low for remote Wikipedia and Wikidata APIs (which can take 3–8 seconds to respond). Additionally, no User-Agent was sent, and Wikipedia blocks or returns empty responses to requests lacking identification.

Version 2.3 introduces a new function, getcurl_api(), with timeouts of 10 and 30 seconds respectively, CURLOPT_FOLLOWLOCATION=true (to follow 301/302 redirects), automatic decompression via CURLOPT_ENCODING='', and an identified User-Agent. For the SPARQL endpoint, the correct header Accept: application/sparql-results+json is additionally used.

Three new knowledge phases

Phases 7, 8, and 9 did not exist in the original version:

  1. Phase 7 (SPARQL): Direct access to the DBpedia RDF graph via SPARQL, illustrating the fundamental difference between conventional web access and semantic access to the same resource.
  2. Phase 8 (Open Library): Bibliographic enrichment that closes the cycle by connecting conceptual analysis with documentary records.
  3. Phase 9 (Graph + Export): Visual synthesis and integration of results into standard Semantic Web formats, transforming the analysis process into an exportable knowledge graph.

Applications and Uses in Library and Information Science

SEMTEST is primarily designed as a teaching tool, but its applications extend beyond the classroom. In information retrieval instruction, it enables practical demonstration of concepts that would otherwise remain abstract: what a URI is, how Linked Open Data works, the difference between a free-text search and a SPARQL query, and what it means for a concept to be semantically “enriched.” The student does not merely read about these concepts: they see them functioning in real time through their own queries. In subject analysis, it can serve as an exploratory tool to identify candidate descriptors, related terms, and bibliographic categories for any concept. Results from DBpedia (dct:subject) and Wikidata are directly comparable with the subject headings lists of the BNE or specialized thesauri. In semantic cataloging projects, export in JSON-LD and Turtle makes SEMTEST a prototype tool for automated generation of structured metadata. Exported records can be ingested directly into a triplestore or used as a foundation for enriching existing bibliographic records. In digital humanities projects, the semantic graph and Open Library bibliography provide a starting point for mapping conceptual relationships within a research domain, identifying key authors, and connecting primary sources with formal representations of knowledge. As a technological demonstrator, it illustrates the current state of Semantic Web services: what is available, how it is accessed, what each API returns, and what its limitations are. This is especially valuable in a context where large language models tend to obscure the underlying infrastructure of structured knowledge.

Final Reflection

Semantic enrichment is not a new process in Documentation. Information professionals have been doing exactly this for decades: taking a term, contextualizing it within a knowledge structure, assigning descriptors, relating it to other concepts, placing it within categories, and representing it in standardized formats. What has changed is that this knowledge structure is now available on the web, accessible via standard APIs, and can be queried automatically.

SEMTEST aims to demonstrate this bridge: that what we do in Documentation when analyzing subjects and constructing representations of knowledge is, conceptually, the same as what a machine does when querying DBpedia, executing a SPARQL query, or downloading the Wikidata graph. The intellectual rigor is the same; what changes is the scale and speed.

Version 2.3 is a more robust and comprehensive tool than the original. But above all, it is a living tool: each query executes real processes on real infrastructures, with results that evolve over time. That, in itself, is part of the lesson.