Google hopes to standardize robots.txt by going open source

Google wants to make the Robots Exclusion Protocol an Internet standard.
Written by Charlie Osborne, Contributing Writer

Google is releasing robots.txt to the open-source community in the hopes that the system will, one day, becoming a stable internet standard. 

On Monday, the tech giant outlined the move to make the Robots Exclusion Protocol (REP) -- better known as robots.txt -- open-source, alongside its matching C++ library. 

REP is a way for webmasters to establish the behavior of code attempting to visit a website. The original creator, Martijn Koster, found that his website was being overwhelmed by crawlers and so in a bid to reduce server strain, developed the initial standard in 1994. 

Commands can be imbued into a text file which decides on the behavior of crawlers and whether or not they are permitted to visit a domain at all. 

However, REP did not become an official standard and so since the 1990s, the protocol has been interpreted in different ways, and it has not been updated for modern use cases. 

See also: Google claims compliance with global tax rules, backs push for international standard

"This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly," Google says. "We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers."

Google has now created draft REP documentation and has submitted its proposal to the Internet Engineering Task Force (IETF), an organization which promotes voluntary Internet standards. 

The draft does not change the rules originally established in 1994 by Koster but does expand robots.txt parsing and matching for modern websites -- such as the inclusion of FTP and CoAP alongside HTTP. 

TechRepublic: You're going to pay more for .org and .info domains following ICANN's lifting of price caps

In addition, Google has proposed that the first 500 kibibytes of a robots.txt file should be parsed in order to reduce server load, and a maximum caching time of 24 hours could also be implemented to prevent websites from being swarmed with indexing requests. 

Google is currently seeking feedback on the draft rules. 

CNET: Google's Doodle contest for kids reveals top 5 finalists

"As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right," Google added. 

Affordable: Top tech, gadgets for under $100

Previous and related coverage

Have a tip? Get in touch securely via WhatsApp | Signal at +447713 025 499, or over at Keybase: charlie0

Editorial standards