On Friday, June 5, 2015, the lecture series on Documentation Technologies concluded at the Faculty of Social and Human Sciences of Universidade Nova de Lisboa. During the event, the latest developments in application and documentary distribution, webcrawler development, search engines, and the initial introduction to the AXYZnews system for aggregation of syndication channels and information monitoring were presented.


Development of Webcrawlers and Search Engines: Mbot and WauSearch

The webcrawler Mbot is a web tracking and analysis program designed to provide useful information and data for conducting web-metric studies on a specific domain on the web. Although Mbot’s objective is similar to that of other programs such as Nutch and Heritrix, its design specifications differ. Mbot was developed to operate in an Apache, PHP, and MySQL environment without requiring specialized libraries or development frameworks that would complicate its installation and configuration. It is designed to crawl the web across multiple tiers or levels of depth (up to a maximum of 10), determined by link analysis cycles. Moreover, it is capable of distinguishing and organizing the types of content it encounters on each webpage—documents, images, information, and text—into a database structure optimized for efficient storage and retrieval of data. Another important feature is its ability to normalize, clean, and index the text of analyzed web pages simultaneously during the crawling process, eliminating the need for post-processing of content. It also includes an automatic report generation module to provide data usable both for web studies and for exporting data blocks (Big Data) and useful information sources for monitoring and informational surveillance through content aggregation systems such as AXYZnews.


Presentation 1. Development of webcrawler and search engine programs


All these characteristics were already known and can be consulted in detail on the official Mbot website. What is novel, however, is the operation of the search engine WauSearch (official WauSearch website), which explains the process of transforming user queries into a final query model that is sent to the search engines Google, Bing, and Yahoo! to obtain a list of results that is subsequently processed. This means that WauSearch is capable of automatically downloading the result pages from major search engines and generating a “seed” that will be crawled by the webcrawler, analyzing these results in depth and obtaining new ones. All of this constitutes a hack or backdoor that WauSearch is using to direct the analysis of the webcrawler, avoiding enormous costs and infrastructures that would be impossible to sustain.

Esquema de funcionamiento del buscador WauSearch

Figure 1. Operation schema of the WauSearch engine


This process of adding results provides information that search engines had not previously incorporated, complementing their original results with those generated by the Mbot webcrawler. Thus, the WauSearch user obtains original information from Google, Bing, Yahoo!, and Mbot, without repetitions and with a proprietary ranking method. For these reasons, WauSearch becomes a testing platform designed to learn from user search experience, allowing modification of result ranking algorithms, webcrawler crawling methods, and the interface and representation of information—all under the researcher’s control.


AMPdoc 2.0 Document Application Ecosystem

The portable distribution of AMPdoc 2.0 document applications can be considered a true ecosystem of tools useful for information professionals seeking to automate their information and documentation unit. AMPdoc 2.0 addresses some of the challenges associated with selecting the most appropriate applications for the various activities carried out by libraries, archives, museums, documentation centers, etc. Furthermore, it provides a reliable solution to the problems of web server support and configuration, as well as the necessary plugins required to run the required programs or applications, enabling relevant functionality tests. It also saves information professionals from complex installations, preventing them from wasting unnecessary time.


Presentation 2. AMPdoc 2.0: Automation of Information Units


Finally, all applications integrated into AMPdoc version 2.0 were presented, including the semantic content manager Bedita, the thesaurus and ontology editor TemaTres, the feed aggregator Selfoss, and the SERP and SEO tracking tool Serposcope.


Applications available since version 1

Figure 2. Applications available since version 1

Applications added in version 2

Figure 3. Applications added in version 2


Some important changes regarding the interface, improvements in application accessibility, the ability to disable and uninstall applications to reduce the distribution size, as well as a new direct contact method for reporting issues, requesting assistance, and collaborating with the developer were also explained.


AXYZnews. Informational Surveillance System

The AXYZnews program has been the subject of research since December 2014. During the conference, the importance of the AXYZnews project was explained, which could be traced back to a prior initiative called SYNC2news, published in 2012 and aimed at creating a news portal through content syndication similar to Google News, with the distinction of implementing all news feeds from media outlets in the United States, the United Kingdom, France, Germany, Mexico, and Spain. Although this project operated for several months, the lack of resources and funding to maintain the infrastructure forced its cancellation or at least the suspension of its development. In December 2014, the Spanish government approved Law 21/2014, of November 4, amending the consolidated text of the Intellectual Property Law, which compelled Google Spain to shut down its news portal. This triggered intense debate and controversy on the Web and social media regarding the restriction of content syndication technologies for the redistribution of information; more importantly, it represented the closure of applications that enable citizens to cross-check information across various media outlets to foster critical and constructive thinking. It can be summarized in one phrase: «The limitation or annulment of the right to information» 


Presentation 3. AXYZnews News Clipping Program


These reasons, among many others explained in the Prezi presentation, have necessarily led me to engage in research with a social, democratic, scientific, and technological mission: to develop a content aggregation system that can be used in Spain despite the new Intellectual Property Law. This is software designed for media research, encompassing publications, content, and the monitoring and surveillance of information through various methods of monitoring and filtering. Its design incorporates modules for configuration, status/maintenance, statistics, syndication channel importation, syndication channel editing, processing monitoring, filter editing, content homepage, real-time information, filtered content, search engine, interactive content map, content notebook, and saved news. Furthermore, the operation of the continuous data processing cores was explained. These cores enable AXYZnews to permanently retrieve all news and content from syndication channels, without duplication, efficiently and effectively.


Schematic of AXYZnews processing cores operation

Figure 4. Schematic of the processing cores of AXYZnews


Although AXYZnews has experienced delays over time, it has now been completed in a fully functional first version. The next step will be making it available to the academic, scientific, and broader societal communities so they can benefit from all the advantages and capabilities offered by this system. In this regard, the different versions of AXYZnews will soon be presented. It is very likely that there will be a blank version (without content), an AXYZnews version dedicated to media outlets in Portugal, another specifically tailored for media outlets in Spain, the United States, Germany, the United Kingdom, France, Mexico, and Brazil. However, it is also highly probable that new specialized versions will be added, focusing on Library and Information Science and various sectors of Medicine. Regarding the official launch date, it is planned to coincide with the final presentation of AXYZnews in Spain. Therefore, the launch as open-source software is likely to take place in September. Updates regarding developments over the coming weeks will continue to be provided through mblazquez.es.