The APLIR project (Applications for Teaching in Information Retrieval) involves the development of tools, simulators, and small applications designed to promote awareness and learning of various aspects related to information retrieval. These tools are used in teaching courses such as Advanced Information Retrieval Techniques, Information Systems Evaluation, and Web Retrieval Systems, and are freely accessible to any information specialist or professional. Below, each tool is described.
Source Code Removal Exercise
An application that demonstrates the functioning of source code removal mechanisms available in many web crawler programs. The user simply needs to input the HTML source code of the webpage they wish to test in order to obtain clean text without tags. This step is crucial for information retrieval on the web, as it precedes normalization, storage, and indexing of web content.
Tokenization Exercise
Demonstrates how a web crawler processes text word by word after source code removal, converting each word into its equivalent hexadecimal string representation.
Character Normalization Exercise
Character normalization occurs when characters, special symbols, and punctuation marks are removed or replaced using pre-stored and programmed replacement strings within the web crawler, yielding a text suitable for indexing and retrieval processing.
Stop Words Removal Exercise
Another essential process for optimizing information retrieval capabilities is stop words removal. It illustrates how, from a given text, the system eliminates all words identified as stop words for each language. If Luhn’s law applies, approximately 50% of all terms in the input text are typically removed.
TF-IDF Weight Calculator
The weight calculator enables the computation of term weights by inputting N (the total number of terms in the collection), DF (the number of documents in which the term appears), and TF (the term frequency in document d). Results are obtained using the default formulation as well as various alternative formulations.
Boolean Model Simulator
The Boolean simulator, like the other simulators, is designed to test its retrieval algorithm on a static, unchanging, real, and reliable collection sufficiently large (20,000 news articles produced by Spanish media sources, obtained from the ReSync platform) to demonstrate various effects and outcomes of each retrieval process. It is possible to test the query operators AND, OR, NOT, and XOR.
Vector Model Simulator
The vector simulator allows users to define the weights of terms used in queries and verify the mathematical computation for each provided result. The formula employed is the cosine similarity, which yields the ranking factor of results with a precision of 10 decimal places.
Probabilistic Model Simulator
The probabilistic model employs a default algorithm that initially calculates query term weights using the value of maximum uncertainty. It also includes a relevance feedback mechanism that enables students to identify which results are relevant, thereby influencing query reformulation and the subsequent calculation of term weights.
Method for Evaluating a Retrieval System
The evaluation method for retrieval systems was developed as a result of research presented at the 9th Hispano-Mexican Seminar in 2012, concerning the development of automated content classification systems. It was deemed essential to verify the accuracy with which the classification/retrieval algorithm performed. In this regard, an automated evaluation template was designed that enables students to assess whether the results obtained for a given thematic category are relevant or not, by marking them with a button that determines the percentage of relevance of the content. Each click or press is transmitted to a database system that processes the information, providing a comprehensive report on the status of the retrieval algorithm's accuracy.
Usability Test for Web Navigation
Although not considered a pure aspect of information retrieval, web usability is closely related to information searching, as it significantly affects the ease of user navigation. In this tool, user navigation is tracked through a series of questions, each representing a search task that must be completed by clicking on hyperlinks. The number of clicks required by the user to achieve the goal indicates which content is most visible and, globally, how usable the website is.
Metadata Parser Analysis Exercise
One of the primary concerns for any student or information professional is the relative importance of metadata in information retrieval. The metadata parser analysis exercise demonstrates that any information encoded in XML as Dublin Core metadata is susceptible to being retrieved, filtered, indexed, and subsequently recovered.