Is it possible to impersonate Google’s website? Is information in search engines compromised? Is data mining feasible on Google? What are the limits of these practices? These are some of the questions the article’s title might suggest. Web spoofing is the technique used to impersonate a web page with the aim of obtaining information from users or the impersonated website. Generally, the use of such techniques involves downloading the website’s source code, modifying it, and replacing it to create a phantom webpage that mimics the original. Aside from the dangers posed by malicious use of spoofing, it can also be employed for scientific purposes—as is the case here. Imagine the possibility of querying Google en masse to generate our own map of the web. Consider the idea that information professionals could build their own databases using information retrieved from specialized sources and resources. Carrying out such plans would likely require many years of effort if undertaken without appropriate tools. However, techniques used in “Web Spoofing” could enable information professionals to work definitively with big data. Would it be possible to spoof Google to query its content and obtain the information we need in a massive, filtered manner? The experiment conducted around Google demonstrates that it is possible.
Web Scraping and Web Crawlers
Before explaining the «Google Web Spoofing» experiment, it is necessary to understand the role played by «Web Scraping» and «Web Crawlers». «Web Scraping» is the technique used to download information from a website. The nature and type of information retrieved varies, as it may include links, text, website headings, or even the complete extraction of the source code. The process is equivalent to what a fully aware user could perform manually, except that it is executed automatically and preprogrammed. Web crawling programs or «Web Crawlers» employ the «Web Scraping» technique to obtain links for constructing a map of the Web in order to index and subsequently retrieve content. For this reason, knowledge of these techniques and systems is not only strategic for impersonating a website—by modifying downloaded web content—but also essential for building knowledge bases in the hands of documentation professionals.
The Google Web Spoofing Experiment
To demonstrate that it is possible to manipulate Google’s information, an impersonation experiment has been designed. It involves obtaining the search engine’s homepage and at least the first page of results when a user submits any query. The outcome, in terms of appearance, can be observed and compared in Figure 1.
Fig.1. Google vs Google Who is who?
Although with minimal differences, the search engine homepages of Google are very similar. It is difficult to determine which is the original website. In fact, both designs are real and genuine. The screen on the right displays the “Google Toolbar,” a black bar providing shortcuts to major applications and services. It also shows the Google Chrome browser download message, advanced search options, and language tools. The design on the left presents a screen without some of the aforementioned elements but maintains a similar appearance. This corresponds to the current view of the search engine in most web browsers. Therefore, the question arises: Which is the original version of Google?
Original Version of Google
The latest update to Google’s design removes the shortcut bar and simplifies access to applications and services, as shown in Figure 2. The microphone icon in the search box for voice queries is also visible. These features allow distinguishing the original version of Google from any other. This can be verified by accessing the website http://www.google.es [Accessed on 2016-01-10].
Fig.2. Original Google website
Impostor Version of Google
Figure 3 shows Google’s alternative design. This appearance is observed when the search engine is opened for the first time in a web browser, or when the variable «noj=1» is activated, which can be verified at the URL «https://www.google.es/?noj=1». Aside from these cases, Google’s alternative design is also activated when its source code is downloaded and executed on a different domain or hosting platform. This occurs because not all styles and functions are correctly linked under different URLs. For this reason, Google appears with a different appearance, triggering the secondary design. This scenario is replicated in Internet Archive through its «Wayback Machine» initiative, which collects copies of the most important web pages. Examining one of the most recent copies of Google confirms the following screenshot.
Fig. 3. Impersonated Google website
The Internet Archive uses the web crawler Heritrix to crawl daily, among other sites, Google’s website and download its source code in order to preserve the digital memory of its homepage, but not its content. This explains why its representation differs from that of the original version, as illustrated by the Google spoofing experiment that can be tested at the following web address: http://www.google.es [Created on 2015-12-22].
Fig. 4. The spoofing technique works with original Google queries and results; only the Complutense University logo reveals that the Google website is not authentic
When a search query is performed, the search engine results page appears, displaying content identical to that provided by the original Google page, as can be observed in the following video.
Google Web Spoofing Experiment
Test: http://mblazquez.es/wp-content/uploads/google.php
Video. https://youtu.be/bdwGgOYDCFo
It can be concluded that it is possible to impersonate Google’s website and consequently access its content through search result pages. This demonstrates that even advanced information systems can be vulnerable to this type of threat, but it also highlights a unique opportunity to advance documentary work by leveraging the primary reference knowledge base to organize web information, develop new information services based on content aggregation, and explore many other possibilities yet to be discovered and investigated.