with an HTTP call. Users submit an HTTP call every time they type a webpage’s URL into a browser and press enter or click on a link on the internet. One of the core features of HTML – the one that enables the World Wide Web to exist as a network of HTML pages – is the ability to embed hypertext, or “links,” to outside documents within a webpage. Crawler bots work by accessing a website through an HTTP call, collecting the hyperlinks embedded within the website’s HTML code, then visiting those hyperlinks using another HTTP call. This process is repeated over and over again to map and catalogue web content. Along the way, crawler bots can be programmed to download the HTML underneath every website, or process facts about those sites in real time (such as whether it appears to be a news outlet or e-commerce site).
Initially, these bots crawled the web and took notes on all the URLs they visited, assembling this information in a database known as a “Web Directory” – a place users could visit to see what websites existed on the web and what they were about. Quickly, advertisers and investors poured funds into these proto-search engines, realizing how many eyes would see them per day as the internet continued to grow (Leonard, 1996).
Though Google eventually became the dominant search engine for navigating the web, the 1990s saw a host of corporate and individual search engine start-ups, all of which used bots to index the web. The first of these was Matthew Grey’s World Wide Web Wanderer in 1993. The next year, Brian Pinkerton wrote WebCrawler, and Michael Mauldin created Lycos (Latin for “wolf spider”), both of which were even more powerful spiders than the World Wide Web Wanderer. Other search engines, like AltaVista and (later) Google, also employed bots to perfect the art of searching for8 and organizing information on the web9 (Indiana University Knowledge Base, 2020; Leonard, 1997, pp. 121–124). The indexable internet – that is, publicly available websites on the World Wide Web that allow themselves to be visited by crawler bots and be listed in search engine results – is known as the “clear web.”10
Spambots and the development of the Robot Exclusion Standard
We have already seen that bots can be used for either good or bad ends, and World Wide Web bots were no different. Originally used as a solution to the problem of organizing and trawling through vast amounts of information on the World Wide Web, bots were quickly adapted for more devious purposes. As the 1990s went on and the World Wide Web (and other online communities like Usenet and IRC) continued to grow, entrepreneurial technologists realized that there was a captive audience on the other end of the terminal. This insight led to the birth of the spambot: online automated tools to promote commercial products and advertisements at scale.
One of the very first spambots was on Usenet. In April 1994, two lawyers, Laurence Canter and Martha Siegel, contracted a programmer to help promote an advert for their law firm’s assistance in the US Green Card Lottery. The programmer decided to use automation to reach as many users as possible. His bot – considered the first spambot on the modern internet – posted the ad to 6,000 newsgroups in under ninety minutes. The incident elicited a strongly negative response from the Usenet community and, in response, one user built a cancelbot that removed all of the spambot’s posts from targeted newsgroups (Leonard, 1997, pp. 165–167).
Usenet was a precursor to more widespread spambot swarms on the internet at large, especially email (Ohno, 2018). Incidents like the botwars on Usenet news groups and IRC servers had, by the late 1990s, made it all too clear that bots would not be only a positive force on the internet. Negative uses of bots (spreading spam, crashing servers, denying content and services to humans, and posting irrelevant content en masse, just to name a few) could easily cause great harm – perhaps most damagingly, crawling websites to gather private or sensitive information.
To solve the problem of bots crawling sensitive websites, a Dutch engineer named Martijn Koster developed the Robot Exclusion Standard11 (Koster, 1994, 1996). The Robot Exclusion Standard (RES) is a simple convention that functions as a digital “Do Not Enter” sign. Every active domain on the internet has a “robots.txt” file that explains what content the site allows bots to access. Some sites allow bots to access any part of their domain, others allow access to some (but not all) parts of the website, and still others disallow bot access altogether. Any site’s robots.txt file can be found by navigating to the website and adding “/robots.txt” to the end of the URL. For instance, you can access Facebook’s instructions for crawler bots at facebook.com/robots.txt. As you would expect, this file disallows nearly all forms of crawling on Facebook’s platform, since this would violate users’ privacy, as well as the platform’s terms of service.
The late 1990s saw several high-profile examples of controversial bots that followed these standards, while arguably violating their intentions, and others who proudly flouted them. RoverBot, a crawler that was created in 1996, was one of these controversial bots. RoverBot was a crawler that retrieved a set of websites relating to a pre-specified topic and scraped email addresses from them. The company that built RoverBot then sold these lists of email addresses to paying customers, who used them to send out spam advertisements. While RoverBot certainly had its detractors, the firm behind it insisted that it followed rules (such as the RES) while scraping the web.
Other spambots did not even follow the letter of the law. For example, a bot known as ActiveAgent ignored the RES altogether, scraping any website it could find looking for email addresses, regardless of the site’s policies on bot access. The anonymous developer behind ActiveAgent had a different business model, though. Rather than selling the email addresses it collected, it sold its source code to aspiring spammers for $100. Buyers could then modify this code for their own purposes, sending out spam emails with whatever message or product they wanted (Leonard, 1997, pp. 140–144). Thanks in part to malicious developers like those behind ActiveAgent, new spamming techniques quickly multiplied as the web grew. Today, spambots and spamming techniques are still evolving and thriving. Estimates vary greatly, but some firms estimate that as much as 84 percent of all email is spam, as of October 2020 (Cisco Talos Intelligence, 2020).
Clearly, the RES is not an absolute means of shutting down crawler bot activity online – it’s an honor system that presumes good faith on the part of bot developers, who must actively decide to make each bot honor the convention and encode these values into the bot’s programming. Despite these imperfections, the RES has seen success online and, for that reason, it continues to underlie bot governance online to this day. It is an efficient way to let bot designers know when they are violating a site’s terms of service and possibly the law.
Social media and the dawn of social bots
Social media supercharged bot evolution in the late 2000s. During this period, the cost of broadband internet declined, connectivity increased, and computing power grew. A growing number of people began to spend more and more time on social media sites, producing their own content. The entire web began to evolve, shifting from a slow, company-driven, rocky experience to a smoother, sleeker, and user-friendly one in which user-generated content took the foreground. This new user-centric version of the internet came to be known as the “web 2.0” (O’Reilly, 2005).
The user-friendly and user-centric web 2.0 had its own problems. Just as advertisers had realized in the 1990s that the World Wide Web was a new revolutionary opportunity for marketing (and sometimes spam), in the 2000s governments and activists began to realize that the new incarnation of the web was a powerful place to spread political messages. In this environment, political bots, astroturfing, and computational propaganda quickly proliferated, though it would take decades for the wider public to realize it (Zi et al., 2010). We’ll examine these dynamics in greater detail and depth in our chapters on political bots and commercial bots.
In every case, online environments that are welcoming to bot innovations – Usenet, IRC, or MUD-gaming platforms in the late 1980s and early 1990s, or Twitter in the late aughts – have consistently been strong drivers of bot evolution. The design of these