Mbot: Multi-purpose Web Crawler for Research

Mbot was created in 2010 to address the specific needs of researchers in Documentation who were engaged in studying the web from a cybermetric perspective. The tools available at the time, such as Nutch or Heritrix, although professional, proved too complex to configure, and their data were difficult to tabulate. Coupled with the lack of initiatives in Spain, this led to the development of the first multipurpose web crawler.

slider-mbot

A web crawler is a computer program specifically designed to traverse the web either selectively or comprehensively, based on the links and hyperlinks within the web pages it analyzes. In the case of Mbot, this classic link-analysis pattern is followed, supplemented by additional derived functions that facilitate academic, scientific, and professional tasks. For instance, it can generate an analysis of the file formats and web pages encountered, identify the most linked pages, analyze metadata and meta-tags found, examine co-links, establish rankings of linked resources and pages, and develop web maps, among many other applications.

Mbot is particularly suited for conducting thematic analyses of the web. This means it is possible to carry out micro-webometric studies that help understand the composition of the web within a specific field of knowledge, its level of interrelation, which contents are most significant, and their relative importance compared to other analyzed elements.

During web analysis processes, Mbot automatically organizes and tabulates information, generating a knowledge base in which web page types are distributed and classified according to their formats and references to image, audio, and video files. Furthermore, it stores standard meta-tags and Dublin Core metadata to enable information retrieval and search engine positioning studies. In addition, Mbot is designed to detect syndication channels, semantic networks, and ontologies, allowing it to perform specialized data mining tasks by gathering primary data through in-depth web analysis. It is also possible to configure Mbot for the massive extraction of email addresses, source code, as well as full-text indexing of analyzed web pages—a functionality that enables it to operate not only as a bot but also as a search engine in its own right. Although Mbot does not aim to compete with major search crawlers and robots, it represents an effective alternative for institutions and companies seeking to reduce dependence on third-party search service providers and maintain anonymity in their queries, provided these are focused on a specific topic.

Presentations at Scientific Forums

Mbot has a long scientific track record that supports the continuous development and improvement of investigated crawling and analysis techniques. The first public test took place in 2010, when the first tests of the Mbot webcrawler were published. In 2011, the first significant test was conducted, consisting of a comparative analysis of the website of the NASA with respect to the ESACúmulus was demonstrated. In 2012, on the occasion of the first Hispano-Brazilian Seminar on Documentation, a webmetric analysis of Brazilian media outlets—covering print, radio, and television—was presented, resulting in the first map of Brazil’s media web. In 2013, the Mbot webcrawler was presented at the XIII Spanish Documentation Days (FESABID), contributing an analysis of the Spanish university web and generating the first map of the Spanish academic web. From May 2013 to August 2013, a significant modernization and improvement process was carried out on Mbot, resulting in a more refined crawling algorithm that substantially enhanced the program’s original capacity and precision, achieving an efficacy rate close to 93.45%.

Versions

4.1 – 2014-01-01 – [In development]
4.0 – 2013-01-01 – Stable version. New crawling system.
3.0 – 2012-01-01 – Improvement of the original crawling system.
2.0 – 2011-01-01 – Incorporates GraphViz graphical representation.
1.0 – 2010-01-01 – Base version of the program. Development initiation.

System Requirements

Apache2+ PHP5+ MySQL5+ server (for example, a distribution such as AppServ or XAMPP with basic libraries would enable optimal functionality)
Multiplatform: Windows, Linux, macOS

Specifications

Integrated configuration. All Mbot configuration options are available on a single page, facilitating its setup for multiple tasks and enabling precise tuning. Within the program’s configuration section, users can modify installation settings, the bulk email system, execution mode, the interface for displaying program execution, analysis levels, types of content to extract during analysis, filters to apply, and webcrawler control properties.
Maintenance module. Mbot includes a dedicated module for maintaining database tables, including defragmentation and repair functions, allowing users to monitor their status.
Intelligent seed management. Like any web crawler, Mbot requires an initial list of URLs, known as seeds, which serve as the foundation for web analysis. In the case of Mbot, the seed management module checks for duplicates and organizes the reading and execution order to ensure the program operates correctly.
Execution control module. Mbot has been designed to operate with multiple execution control modes, allowing users to monitor the progress of the analysis, as well as the documents and content formats extracted during the process. Additionally, it includes the option to pause the analysis task and resume it later when needed, enabling researchers or specialists to operate on non-dedicated, poorly equipped, or non-professional computing systems.
Reporting module. Mbot provides detailed information on any quantifiable aspect.
General data analysis, level by level.
Analysis of first- and second-level domains.
General analysis of file formats.
General analysis of internal and external links.
Analysis of internal and external links domain by domain.
Ranking of most linked web pages.
Ranking of most linked websites.
Meta tags by domain.
Meta tag text by domain and page.
TF frequency analysis of meta-tags.
Metadata by domain.
Metadata text by domain and page.
TF frequency analysis of metadata.
Ranking of websites with the most web pages.
Ranking of websites with the most content and documents.
Ranking of websites with the most syndication channels.
Export of syndication channels.
Analysis of the web macrostructure using the components IN, MAIN, OUT, T-IN, T-OUT, TUNNEL.
Analysis of co-links.
Export of a DOT-type graphic file to generate a graphical diagram of the analyzed web.
Search module. Mbot features a fully integrated retrieval system designed to perform queries on the analyzed content, thanks to its patented indexing method and the text purification process developed specifically for the program.
Mass email module. Given the webcrawler’s capability to collect email addresses, a comprehensive tool has been developed to enable mass email sending, allowing full editing of messages in both plain text and HTML format, inclusion of attached documents, and anonymous sending capabilities.

Screenshots

Mbot webcrawler homepage

Figure 1. Mbot webcrawler homepage

Figure 2. Login page

captura-mbot01

Figure 3. Unified configuration module

Execution in text mode

Figure 4. Execution in text mode

Execution in plus mode

Figure 5. Execution in plus mode

Execution in text links mode

Figure 6. Execution in text links mode

Execution in hypertext links mode

Figure 7. Execution in hypertext links mode

Execution in icon mode

Figure 8. Execution in icon mode

Periodic table of elements and file formats analyzed by Mbot

Figure 9. Periodic table of elements and file formats analyzed by Mbot

Video demonstration of Mbot 3.0 at FESABID 2013

This video shows version 3.0 of the Mbot webcrawler performing an analysis of Spanish university websites. The monitoring method can be observed in real time, displaying all collected content and its typology. For instance, PDF files are identified by their icon, metadata by the Dublin Core logo, MS Office documents, etc. Additionally, the speed of the analysis and its impact on the MySQL database where the information is stored—perfectly prepared for cybermetric analysis—are evident.

Video 1. Configuration of the Mbot webcrawler program

Web Map of Spanish Universities

The following video presents the web map of Spanish universities, derived from a webmetric analysis of 147 university websites at three levels of depth. This map is the result of the fourth test conducted with the Mbot webcrawling tool, which was presented at FESABID 2013.

Video 2. Graph of the Spanish University, created with Mbot and GraphViz

A glimpse at Mbot 4.0

Version 4 of the webcrawler Mbot represents a substantial advancement over previous web analysis processes, making it a more accurate tool. Moreover, Mbot 4 includes new functionalities for bulk email sending and additional reports for web metrics analysis.

Video 3. Execution of the Mbot webcrawler program

References

Blázquez-Ochando, M. (2010). [E-prints]. First tests of the Mbot webcrawler. http://www.mblazquez.es/lab/mbotPrimer/paper_primeras-pruebas-mbot-webcrawler.html
Blázquez-Ochando, M.; Serrano-Mascaraque, E. (2011). [Paper]. Web analysis and usability: functionality test of the Mbot webcrawler. In: X Congress of the Spanish Chapter of ISKO (La Coruña, June 30 – July 1). http://eprints.rclis.org/19104/
Blázquez-Ochando, M.; Serrano-Mascaraque, E. (2011). [Paper]. Integration of webcrawler technology into information source management systems: development of the Cumulus2 application. In: Tenth Ibero-American Conference on Systems, Cybernetics and Informatics CISCI (Orlando, July 19–22). Vol. 3, pp. 39–44. http://eprints.rclis.org/19105/

Blázquez-Ochando, M. (2012). [Paper]. Webometric analysis of Brazilian media: press, radio, and television. In: I Hispano-Brazilian Seminar on Library and Information Science (Madrid, November 28-30).http://eprints.rclis.org/19033/

Blázquez-Ochando, M. (2013). [Paper]. Technological and documentary development of the webcrawler Mbot: a web analysis test of Spanish universities. In: XIII Spanish Documentation Days, FESABID (Toledo, May 21-24).

Using Mbot

Mbot enables the development of complete or customized information collection processes starting from a collection of links or URLs known as “seeds.” Thus, multiple web analysis processes can be carried out, tailored to each user’s needs, considering the following aspects.

The intended use of the acquired information, whether for scientific, commercial data mining, or advertising purposes.
The reports selected by the user and their processing.
The type of data or information required to be extracted to complete these reports.
The number of links included in the seed.
The number of pages analyzed.
The volume of data and information collected.