Nick Monaco

Bots


Скачать книгу

with an HTTP call. Users submit an HTTP call every time they type a webpage’s URL into a browser and press enter or click on a link on the internet. One of the core features of HTML – the one that enables the World Wide Web to exist as a network of HTML pages – is the ability to embed hypertext, or “links,” to outside documents within a webpage. Crawler bots work by accessing a website through an HTTP call, collecting the hyperlinks embedded within the website’s HTML code, then visiting those hyperlinks using another HTTP call. This process is repeated over and over again to map and catalogue web content. Along the way, crawler bots can be programmed to download the HTML underneath every website, or process facts about those sites in real time (such as whether it appears to be a news outlet or e-commerce site).

      Initially, these bots crawled the web and took notes on all the URLs they visited, assembling this information in a database known as a “Web Directory” – a place users could visit to see what websites existed on the web and what they were about. Quickly, advertisers and investors poured funds into these proto-search engines, realizing how many eyes would see them per day as the internet continued to grow (Leonard, 1996).

      We have already seen that bots can be used for either good or bad ends, and World Wide Web bots were no different. Originally used as a solution to the problem of organizing and trawling through vast amounts of information on the World Wide Web, bots were quickly adapted for more devious purposes. As the 1990s went on and the World Wide Web (and other online communities like Usenet and IRC) continued to grow, entrepreneurial technologists realized that there was a captive audience on the other end of the terminal. This insight led to the birth of the spambot: online automated tools to promote commercial products and advertisements at scale.

      Usenet was a precursor to more widespread spambot swarms on the internet at large, especially email (Ohno, 2018). Incidents like the botwars on Usenet news groups and IRC servers had, by the late 1990s, made it all too clear that bots would not be only a positive force on the internet. Negative uses of bots (spreading spam, crashing servers, denying content and services to humans, and posting irrelevant content en masse, just to name a few) could easily cause great harm – perhaps most damagingly, crawling websites to gather private or sensitive information.

      To solve the problem of bots crawling sensitive websites, a Dutch engineer named Martijn Koster developed the Robot Exclusion Standard11 (Koster, 1994, 1996). The Robot Exclusion Standard (RES) is a simple convention that functions as a digital “Do Not Enter” sign. Every active domain on the internet has a “robots.txt” file that explains what content the site allows bots to access. Some sites allow bots to access any part of their domain, others allow access to some (but not all) parts of the website, and still others disallow bot access altogether. Any site’s robots.txt file can be found by navigating to the website and adding “/robots.txt” to the end of the URL. For instance, you can access Facebook’s instructions for crawler bots at facebook.com/robots.txt. As you would expect, this file disallows nearly all forms of crawling on Facebook’s platform, since this would violate users’ privacy, as well as the platform’s terms of service.

      Other spambots did not even follow the letter of the law. For example, a bot known as ActiveAgent ignored the RES altogether, scraping any website it could find looking for email addresses, regardless of the site’s policies on bot access. The anonymous developer behind ActiveAgent had a different business model, though. Rather than selling the email addresses it collected, it sold its source code to aspiring spammers for $100. Buyers could then modify this code for their own purposes, sending out spam emails with whatever message or product they wanted (Leonard, 1997, pp. 140–144). Thanks in part to malicious developers like those behind ActiveAgent, new spamming techniques quickly multiplied as the web grew. Today, spambots and spamming techniques are still evolving and thriving. Estimates vary greatly, but some firms estimate that as much as 84 percent of all email is spam, as of October 2020 (Cisco Talos Intelligence, 2020).

      Clearly, the RES is not an absolute means of shutting down crawler bot activity online – it’s an honor system that presumes good faith on the part of bot developers, who must actively decide to make each bot honor the convention and encode these values into the bot’s programming. Despite these imperfections, the RES has seen success online and, for that reason, it continues to underlie bot governance online to this day. It is an efficient way to let bot designers know when they are violating a site’s terms of service and possibly the law.

      The user-friendly and user-centric web 2.0 had its own problems. Just as advertisers had realized in the 1990s that the World Wide Web was a new revolutionary opportunity for marketing (and sometimes spam), in the 2000s governments and activists began to realize that the new incarnation of the web was a powerful place to spread political messages. In this environment, political bots, astroturfing, and computational propaganda quickly proliferated, though it would take decades for the wider public to realize it (Zi et al., 2010). We’ll examine these dynamics in greater detail and depth in our chapters on political bots and commercial bots.