There are harmless bots that crawl the web and announce what they're doing as they go about their business - and then there are bots that don't. Google, Facebook, Yahoo, and others in the online ad business are teaming up to stamp out the latter under a new pilot program with the Trustworthy Accountability Group (TAG).
At launch Google will be the major contributor to TAG's war on ad fraud, offering up its own datacenter blacklist to ward off non-human ad traffic emanating from networks across the globe. Google's list aims to filter so-called "bad bots" that aren't detected by the IAB/ABC international Spiders & Bots List -- an ad-industry initiative that launched in 2006 to combat the estimated 40 percent of ad impressions suspected of coming from bots rather than humans.
Others contributing to the project include Dstillery, Facebook, MediaMath, Quantcast, Rubicon Project, The Trade Desk, TubeMogul, and Yahoo.
"Well-behaved bots announce that they're bots as they surf the web by including a bot identifier in their declared User-Agent strings. The bots filtered by this new blacklist are different. They masquerade as human visitors by using User-Agent strings that are indistinguishable from those of typical web browsers," Vegard Johnsen, a Google ad traffic quality product manager, wrote in a blog post.
Distil Networks, a company that filters bot traffic that's not participating in the TAG initiative, in May estimated that only 41 percent of web traffic originates from humans, with 23 percent coming from "bad bots" and 36 percent coming from good bots.
The company noted that one of the biggest source of bad bots was Amazon, thanks to the low cost of its EC2 cloud. Others cited as major ISP and datacenter bad bot originators included Verizon Business, Level 3, Comcast Cable, and IBM's Softlayer.
Meanwhile, the hardest hit by bad bots are online publishers, with one-third of traffic to the category ranked as coming from bad bots.
But as Johnson notes, "unscrupulous publishers" are often part of the problem, employing dodgy techniques such as software tools at datacenters to boost traffic to their site and create fake impressions and clicks.
Datacenter bots are just one type of non-human traffic, but would have a major effect on click through rates on ads served by Google's Double Click if the company didn't apply its datacenter filter, which in May filtered 8.9 percent of all clicks on DoubleClick Campaign Manager.
Two examples of bad bot software hosted on datacenters were UrlSpirit and HitLeap. In May there were over 6,500 datacenter installations of UrlSpirit while in mid-June there were 4,800 HitLeap installations running on virtual machines in datacenters. UrlSpirit generated about half a billion fraudulent ad requests a month while HitLeap generated about a billion fraudulent ad requests a month.
Describing the UrlSpirit scheme, Johnson said: "Participating publishers install the UrlSpirit application on Windows machines and they each submit up to three URLs through the application's interface. Submitted URLs are then distributed to other installed instances of the application, where Internet Explorer is used to automatically visit the list of target URLs. Publishers who have not installed the application can also leverage the network of installations by paying a fee."
The other form of problem datacenter bots Johnson noted were ones that don't aim to defraud advertisers but nonetheless impact ad campaigns, for example tools run by an unnamed advertising analytics company whose scrapers were responsible for 65 percent of automated datacenter clicks in May.