Google Scraping: Extract Data from Web Sources

The creation of expert knowledge bases depends largely on the knowledge of information resources. However, the documentalist may overlook key information sources due to the vast scope of the network's documentary spectrum. The tools and techniques available for discovering new content (such as web crawlers and data mining) do not always provide a comprehensive overview. For this reason, the scientific community increasingly focuses on major search engines. The case under consideration concerns Google and Google Scholar, due to their relevance for developing webometric and scientometric research, as well as for generating datasets and document collections that lead to the creation of specialized Big Data.

If it were possible to track and index Google’s search results, researchers could automatically create knowledge bases by downloading only those strategic resources and contents that meet their specific information needs, leveraging the querying power of the search engine. It would also be feasible to compile document collections of patents, databases, office documents, texts, and specialized multimedia resources. To a large extent, the classification of retrieved information would be determined by the queries submitted to the search engine, providing an excellent starting point for organizing knowledge. Moreover, researchers could incorporate into their studies web sectors entirely unfamiliar to them. In the productive domain, this would have significant implications for the development of new specialized search engines, whose development costs would be substantially low since they would not rely on proprietary server infrastructure but rather on the leading search engine. All of this without mentioning that companies offering product and service comparisons (e.g., for insurance, flights, hotels, etc.) could expand their coverage to include comparison of the search engine’s own content, rather than a selected set of websites. Yet, it is likely that applications of web scraping in search engines are still yet to be invented.

For all these reasons, it is evident that the technique of “web scraping” holds great relevance for the future of Documentation, both because it enables information professionals to manage web content directly and due to its socio-economic dimension, which contributes to the development and creation of new enterprises.

In order to demonstrate that it is possible to track Google search result content and pages and leverage their information, a web scraping experiment has been developed with the objective of retrieving the contents of one or several result pages. Additionally, the web scraping program has been connected to a custom webcrawler system based on Mbot, which enables re-crawling and indexing of the contents determined by the user. Thus, the web scraping program applied to the Google search engine becomes another search engine that expands the information it provides, further enriching the original content of each webpage and website. It can be likened to a selective web crawling approach based on the user’s relevant results.

The experiment was presented at the XIII Hispano-Mexican Seminar on Library and Information Science held at the Institute of Bibliological Research of UNAM in Mexico City, and has also been covered by the specialized blog BIBLIORed 3.0.

▶ Google Scraping Experiment

https://mblazquez.es/lab/google2down/

scraping001

Figure 1. The SERP (Search Engine Results Page) content is retrieved by a web scraping program specifically designed for Google and Google Scholar

scraping002

Figure 2. The results can be analyzed using a webcrawler derived from Mbot that recognizes headings, paragraphs, links, text, and other elements on each webpage selected by the user

scraping003

Figure 3. The program has been designed to work with Google Scholar, given its potential for conducting scientometric studies