The APLIR project (Applications for Teaching in Information Retrieval) involves the development of tools, simulators, and small applications designed to promote awareness and learning of various aspects related to information retrieval. These tools are used in teaching courses such as Advanced Information Retrieval Techniques, Evaluation of Information Systems, and Web Retrieval Systems, and are freely accessible to any information specialist or professional. Below is a description of each tool.
- Source Code Removal Exercise
- A tool to demonstrate the functioning of source code removal mechanisms available in many web crawler programs. The user simply needs to input the HTML source code of the webpage they wish to test in order to obtain clean text without tags. This step is crucial for information retrieval on the web, as it precedes normalization, storage, and indexing of web content.
- http://mblazquez.es/blog-ccdoc-recuperacion/programas/depuracion01.php
- Tokenization Exercise
- Teaches how a web crawler processes the text obtained from the source code removal exercise, word by word, converting each word into its equivalent hexadecimal string representation.
- http://mblazquez.es/blog-ccdoc-recuperacion/programas/depuracion02.php
- Character Normalization Exercise
- Character normalization occurs when characters, special symbols, and punctuation marks are removed or replaced using pre-stored and programmed replacement strings in the web crawler to obtain a text suitable for indexing and processing for retrieval.
- http://mblazquez.es/blog-ccdoc-recuperacion/programas/depuracion03.php
- Exercise on removal of stop words
- Another essential process for optimizing information retrieval capabilities is the removal of stop words. It demonstrates how, from a given text, the system eliminates all words identified as stop words for each language. If Luhn’s law applies, approximately 50% of all terms in the input text are typically removed.
- http://mblazquez.es/blog-ccdoc-recuperacion/programas/depuracion04.php
- TF-IDF Weight Calculator
- The weight calculator enables computations related to the weight of a given term by inputting N (the total number of terms in the collection), DF (the number of documents in which the term appears), and TF (the term frequency in document d). Results are obtained using the default formulation as well as various alternative formulations.
- http://mblazquez.es/blog-ccdoc-recuperacion/programas/ponderacion01.php
- Boolean Model Simulator
- The Boolean simulator, like the other simulators, is designed to test your retrieval algorithm on a static, unchanging, real, and reliable collection of documents, sufficiently large (20,000 news articles produced by Spanish media sources, obtained from the Resync platform) to demonstrate various effects and outcomes of each retrieval process. It is possible to test the AND, OR, NOT, and XOR query operators.
- http://mblazquez.es/blog-ccdoc-recuperacion/programas/modelobooleano.php
- Vector Model Simulator
- The vector simulator allows you to define the weights of the terms used in queries and verify the mathematical calculations for each provided result. The formula employed is the cosine similarity, which yields the ranking factor of results with a precision of 10 decimal places.
- http://mblazquez.es/blog-ccdoc-recuperacion/programas/modelovectorial.php
- Probabilistic Model Simulator
- The probabilistic model employs a default algorithm that initially calculates the weights of query terms using the value of maximum uncertainty. It also includes a relevance feedback mechanism, allowing students to identify which results are relevant for inclusion in query reformulation and, consequently, in the calculation of term weights.
- http://mblazquez.es/blog-ccdoc-recuperacion/programas/modeloprobabilistico.php
- Method for evaluating a retrieval system
- The evaluation method for retrieval systems was developed as a result of research presented at the 9th Hispano-Mexican Seminar in 2012, concerning the development of automated content classification systems. It was deemed essential to verify the accuracy with which the classification/retrieval algorithm performed. In this regard, an automated evaluation template was designed that enables students to assess whether the results obtained for a given thematic category are relevant or not, by marking them with a button that determines the percentage of relevance of the content. Each click or press is transmitted to a database system that processes the information, providing a comprehensive report on the status of the retrieval algorithm’s accuracy.
- http://mblazquez.es/testbench/evaluacion/prueba1-es/
- Web navigation usability test
- Although not considered a pure aspect of information retrieval, web usability is closely related to information searching, as it significantly affects the ease of user navigation. In this tool, user navigation is tracked through a series of questions, each representing a search task that must be completed by clicking on hyperlinks. The number of clicks required by the user to reach the target demonstrates which content is most visible and, overall, how usable the website is.
- http://www.mblazquez.es/blog-ccdoc-arquitectura-informacion/test-usabilidad1/test1.php
- Metadata parser analysis exercise
- One of the primary concerns for any student or information professional is the relative importance of metadata in information retrieval. The metadata parser analysis exercise demonstrates that any information encoded in XML as Dublin Core metadata is susceptible to being retrieved, filtered, indexed, and subsequently recovered.
- http://www.mblazquez.es/blog_ccdoc-busqueda-internet/programas/parser-metadata.php