Is it possible to impersonate Google’s website? Is information in search engines compromised? Is data mining possible on Google? What are the limits of these practices? These are some of the questions the article’s title might suggest. Web spoofing is the technique used to impersonate a web page with the aim of obtaining information from users or the impersonated site. Generally, the use of such techniques involves downloading the website’s source code, modifying it, and replacing it to create a phantom webpage that mimics the original. Beyond the dangers posed by malicious use of spoofing, it can also be employed for scientific purposes—as is the case here. Imagine the possibility of querying Google en masse to generate our own map of the web. Consider the idea that Information Professionals could create their own databases using information retrieved from specialized sources and resources. Carrying out such plans would likely require many years of effort if undertaken without appropriate tools. However, techniques used in “Web Spoofing” could enable information professionals to work definitively with Big Data. Would it be possible to spoof Google to query its content and obtain the information we need in a massive and filtered manner? The experiment conducted around Google demonstrates that it is possible.
Web Scraping and Web Crawlers
Before explaining the «Google Web-Spoofing» experiment, it is necessary to understand the role played by «Web Scraping» and «Webcrawlers». «Web Scraping» is the technique used to download information from a website. The nature and type of information downloaded varies, as it may include links, text, website headlines, up to the complete extraction of the source code. The process is equivalent to what could be done manually by a fully aware user, except that it is executed automatically and preprogrammed. Web crawling programs or «Webcrawlers» employ the «Web Scraping» technique to obtain links with which to construct the web map for indexing and subsequent retrieval. For this reason, knowledge of these techniques and systems is not only strategic for impersonating a website by modifying downloaded web content, but also for building knowledge bases in the hands of documentation professionals.
The Google Web-Spoofing Experiment
To demonstrate that it is possible to manipulate Google’s information, an impersonation experiment has been designed. It consists of obtaining the search engine’s homepage and at least the first page of results when a user performs any query. The result, in terms of appearance, can be observed and compared in Figure 1.
Figure 1. Google vs Google Who is who?
Although with minimal differences, the search engine homepages of Google are very similar. It is difficult to determine which website is the original. In fact, both designs are real and genuine. The screen on the right displays the “Google Toolbar,” the black bar of shortcuts to major applications and services. It also shows the download prompt for the Google Chrome browser, advanced search options, and language tools. The design on the left shows a screen without some of the aforementioned elements, yet maintaining a similar appearance. This corresponds to the current view of the search engine in most web browsers. Therefore, the question arises: Which is the original version of Google?
Original Version of Google
The latest update to Google’s design removes the shortcut bar and simplifies access to applications and services, as illustrated in Figure 2. The microphone icon in the search box for voice queries is also visible. These features allow differentiation between the original version of Google and any other variant. This can be verified by accessing the website http://www.google.es [Accessed on 2016-01-10].
Figure 2. Original Google website
Impostor Version of Google
Figure 3 shows Google’s alternative design. This appearance is observed when the search engine is opened for the first time in a web browser, or when the variable «noj=1» is activated, which can be verified at the URL «https://www.google.es/?noj=1». Aside from these cases, Google’s alternative design is also activated when its source code is downloaded and executed on a different domain or hosting platform. This occurs because not all styles and functions are correctly linked under different URLs. For this reason, Google appears with a different appearance, activating the secondary design. This scenario is replicated in Internet Archive through its «Wayback Machine» initiative, which collects copies of the web’s most important pages. Examining one of the most recent copies of Google confirms the following screenshot.
Figure 3. Spoofed Google website
The Internet Archive uses the web crawler Heritrix to daily crawl, among other sites, Google’s website and download its source code to preserve the digital memory of its homepage, but not its content. This explains why its representation differs from the original version, as demonstrated in the Google spoofing experiment that can be tested at the following web address: http://www.google.es [Created on 2015-12-22].
Figure 4. The spoofing technique works with original Google queries and results; only the Complutense University logo reveals that the Google website is not authentic
When a search query is performed, the search engine’s results page appears, containing content identical to that provided by Google’s original page, as can be observed in the following video.
Google Web-Spoofing Experiment
It can be concluded that it is possible to impersonate Google’s website and consequently access its content through search result pages. This demonstrates that even advanced information systems can be vulnerable to this type of threat, but it also highlights a unique opportunity to enhance documentary work by leveraging the primary reference knowledge base to organize web information, develop new information services based on content aggregation, and explore many other possibilities yet to be discovered and investigated.