Mbot was created in 2010 to address the specific needs of researchers in Documentation who were studying the web from a cybermetric perspective. The tools available at the time, such as Nutch or Heritrix, although professional, proved too complex to configure, and their data difficult to tabulate. Combined with the lack of initiatives in Spain, this led to the development of the first multipurpose web crawler.

slider-mbot

A web crawler is a computer program specifically designed to traverse the web either selectively or comprehensively, based on the links and hyperlinks of the analyzed web pages. In the case of Mbot, this classic link-analysis pattern is followed, supplemented by additional derived functions that facilitate academic, scientific, and professional tasks. For instance, it can generate an analysis of the file formats and web pages encountered, identify the most linked pages, analyze metadata and meta-tags found, examine co-links, establish rankings of linked resources and pages, and develop web maps, among many other applications.

Mbot is particularly suited for thematic web analysis. This means it is possible to conduct micro-webmetric studies that help understand the composition of the web within a specific field of knowledge, its level of interrelation, which contents are most relevant, and their relative importance compared to other analyzed elements.

During web analysis processes, Mbot automatically organizes and tabulates information, generating a knowledge base in which web page types are distributed and classified according to their formats and references to image, audio, and video files. Furthermore, it stores standard meta-tags and Dublin Core metadata to enable information retrieval and search engine positioning studies. In addition, Mbot is designed to detect syndication channels, semantic networks, and ontologies, allowing it to perform specialized data mining tasks by gathering primary data through in-depth web analysis. Mbot can also be configured for massive extraction of email addresses, source code, as well as full-text indexing of analyzed web pages—a functionality that enables it to function not only as a bot but as a search engine in its own right. Although Mbot does not aim to compete with major search crawlers and bots, it represents an effective alternative for institutions and companies seeking to reduce dependence on third-party search service providers and maintain anonymity in their queries, provided these are focused on a specific topic.

Presentations at scientific forums

Mbot has a long scientific track record that supports the continuous development and improvement of investigated crawling and analysis techniques. The first public test took place in 2010, when the first tests of the Mbot webcrawler were published. In 2011, the first significant test was conducted, consisting of a comparative analysis of the NASA website against the ESA, which was presented at the X Congress of the Spanish chapter of ISKO. Shortly thereafter, at the X Conference on Systems, Cybernetics and Informatics (CISCI), held in Florida, Mbot’s integration capability into information source management programs such as Cumulus was demonstrated. In 2012, during the first Hispano-Brazilian Seminar on Documentation, a webmetric analysis of Brazilian media outlets—covering print, radio, and television—was presented, resulting in the first map of Brazil’s media web. In 2013, the Mbot webcrawler was presented at the XIII Spanish Documentation Days (FESABID), providing an analysis of the Spanish university web and generating the first map of the Spanish academic web. From May 2013 to August 2013, a significant modernization and improvement process was carried out on Mbot, resulting in a more refined crawling algorithm that substantially enhanced the program’s original capacity and precision, achieving an effectiveness close to 93.45%.

Versions

4.1 – 2014-01-01 – [In development]

4.0 – 2013-01-01 – Stable version. New crawling system.

3.0 – 2012-01-01 – Improvement of the original crawling system.

2.0 – 2011-01-01 – Incorporates Graphviz graphical representation.

1.0 – 2010-01-01 – Base version of the program. Development initiation.

System Requirements

  • Apache2+ PHP5+ MySQL5+ server (e.g., a distribution such as AppServ or Xampp with basic libraries would enable optimal functionality)
  • Multiplatform: Windows, Linux, MacOS

Specifications

  • Integrated configuration. All Mbot configuration options are available on a single page, facilitating preparation for multiple tasks and precise tuning. In the program’s configuration section, installation data, the mass email system, execution mode, program execution display interface, analysis levels, content types to be extracted, filters to be applied, and webcrawler control properties can all be modified.
  • Maintenance module. Mbot includes a dedicated module for maintaining database tables, including defragmentation and repair, allowing monitoring of their status.
  • Intelligent seed editing. Like any web crawler, to initiate its operation, Mbot requires a list of URLs called seeds, which serve as the basis for web analysis. In the case of Mbot, the seed management module checks for duplicates and organizes the reading and execution order to ensure proper program functionality.
  • Execution control module. Mbot has been designed to operate with multiple execution control modes, allowing users to monitor the progress of the analysis as well as the documents and content formats extracted during the process. It also includes the option to pause the analysis task and resume it later when needed, enabling researchers or specialists to operate on non-dedicated, poorly equipped, or non-professional computing systems.
  • Reporting module. Mbot provides detailed information on any quantifiable aspect.
  • General level-by-level data analysis.
  • First- and second-level domain analysis.
  • General file format analysis.
  • General internal and external link analysis.
  • Domain-by-domain analysis of internal and external links.
  • Ranking of most linked web pages.
  • Ranking of most linked websites.
  • Meta-tags by domain.
  • Meta-tag text by domain and page.
  • TF frequency analysis of meta-tags.
  • Metadata by domain.
  • Metadata text by domain and page.
  • TF frequency analysis of metadata.
  • Ranking of websites with the most web pages.
  • Ranking of websites with the most content and documents.
  • Ranking of websites with the most syndication channels.
  • Syndication channel export.
  • Analysis of the web's macrostructure using the components IN, MAIN, OUT, T-IN, T-OUT, TUNNEL.
  • Analysis of co-links.
  • Export of a DOT-type graphic file to generate a graphical diagram of the analyzed web.
  • Search module. Mbot features a fully integrated retrieval system designed to perform queries on the analyzed content, thanks to its patented indexing method and the text purification process developed specifically for the program.
  • Mass email module. Given the webcrawler’s capability to collect email addresses, a complete tool has been developed for sending mass emails, allowing full editing of messages in plain text and HTML, inclusion of attached documents, as well as anonymous sending capability.

Screenshots

Mbot webcrawler homepage

Mbot webcrawler homepage

Entry page

Entry page

captura-mbot01

Unified configuration module

Execution in text mode

Execution in text mode

Execution in plus mode

Execution in plus mode

Execution in text links mode

Execution in text links mode

Execution in hypertext links mode

Execution in hypertext links mode

Execution in icon mode

Execution in icon mode

Periodic table of elements and file formats analyzed by Mbot

Periodic table of elements and file formats analyzed by Mbot

Videos

Demonstration of Mbot 3.0 at Fesabid 2013

The present video shows version 3.0 of the Mbot webcrawler performing an analysis of Spanish university websites. The monitoring method can be observed in real time, displaying all collected content along with its typology. For example, PDF files are identifiable by their icon, metadata by the Dublin Core logo, MS Office documents, etc. Additionally, the speed of the analysis and its impact on the MySQL database where the information is stored—perfectly configured for cybermetric analysis—are evident.



Web Map of the Spanish University System

The following video shows the web map of the Spanish university system, derived from a webmetric analysis of 147 university websites at three levels of depth. This map is the result of the fourth test conducted with the Mbot webcrawling tool, presented at Fesabid 2013.



A glimpse of Mbot 4.0

Version 4 of the Mbot webcrawler represents a substantial advancement over previous web analysis processes, making it a more precise tool. Moreover, Mbot 4 includes new functionalities for bulk email sending and additional reports for webmetric analyses.



References

  • BLÁZQUEZ OCHANDO, M. 2010. [eprint]. First tests of the Mbot webcrawler. Available at: http://www.mblazquez.es/documents/articulo-pruebas1-mbot.html
  • BLÁZQUEZ OCHANDO, M.; SERRANO MASCARAQUE, E. 2011. [Paper]. Web analysis and usability: functional testing of the Mbot webcrawler. In: X Congress of the Spanish Chapter of ISKO (La Coruña, June 30 – July 1). Available at: http://eprints.rclis.org/19104/
  • BLÁZQUEZ OCHANDO, M.; SERRANO MASCARAQUE, E. 2011. [Paper]. Integration of webcrawler technology into information source management systems: development of the Cumulus2 application. In: Tenth Iberoamerican Conference on Systems, Cybernetics and Informatics CISCI (Orlando, July 19–22). Vol. 3, pp. 39–44. Available at: http://eprints.rclis.org/19105/
  • BLÁZQUEZ OCHANDO, M. 2012. [Paper]. Webmetric analysis of Brazilian media: press, radio, and television. In: I Hispano-Brazilian Seminar on Library and Information Science (Madrid, November 28–30). Available at: http://eprints.rclis.org/19033/
  • BLÁZQUEZ OCHANDO, M. 2013. [Paper]. Technological and documentary development of the Mbot webcrawler: web analysis testing of Spanish universities. In: XIII Spanish Documentation Days, Fesabid (Toledo, May 21–24).

Using Mbot

Mbot enables the development of complete or customized information collection processes, starting from a collection of links or URLs referred to as «seeds». Thus, multiple web analysis processes can be carried out, tailored to each user, considering the following aspects.

  • The intended use of the collected information, whether for scientific, commercial data mining, or advertising purposes.
  • The reports selected by the user and their processing.
  • The type of data or information necessary to extract in order to fulfill these reports.
  • The number of links included in the seed.
  • The number of pages analyzed.
  • The volume of data and information collected.