Clicks: where did you think the data was coming from?

The furore over whether Bing should have surgically excised the clickstream from Google out of the results they get about what IE and Bing Bar...

The furore over whether Bing should have surgically excised the clickstream from Google out of the results they get about what IE and Bing Bar users do online (and that link is an interesting perspective from a Bing developer) - or perhaps have engineered in some checking on results going into the index based on only a single clickstream 'signal' -raises some interesting points about what counts as fair competition, what counts as unfair behaviour and how machine learning is dominating development (at Microsoft, Google and elsewhere).

If Bing shouldn't be able to use information about what users do on the Google site (which might seem a reasonable prohibition at first thought), should they be able to use the results of your pattern of behaviour on Blekko, or on Ask or the BBC or on Amazon? Bing and Google both know things about where some products are on the Amazon site that the Amazon search engine doesn't, as I've found when checking the Amazon UK selling price for some devices - a Google or Bing search for the product name plus the keyword phrase 'Amazon UK' will give you the Amazon UK page for some products the search box on Amazon UK just can't find. I find that really useful.

We give our social graph to Facebook and LinkedIn and Twitter for free, in exchange for a convenient place to have conversations and they use it themselves to develop new services - and sometimes they share it or sell it. Twitter thinks the 'firehose' of tweets in transit is both public and something it's happy to share; any search engine can negotiate for or buy access to use it as a way of understanding links and content and semantics and anything else they think they can learn. But if you restrict search engines and services to only getting information from sites they've negotiated an agreement with, searching Amazon and Twitter and the BBC from Bing might carry on improving, but a million smaller sites would fall by the wayside.

And the precedent of saying that a site can keep the anonymised but publicly available details of how users use the site for only their own information and development has other implications for all sorts of services we've come to expect to be freely available. How much of the metadata of the way a site is built and used should be proprietary? There's no benefit to a hotel site that has a French version of its English pages in having Google scrape the parallel pages to use as the basis of its machine learning-based translation tools between French and English; indeed it might help a competitive hotel that hasn't paid to do translations into French because it knows visitors can use the Google service. But then that French/English hotel site probably isn't complaining because the translation service lets Italian visitors read a machine translation of their service and the Italian/English sites the service has learned from get the benefit of machine translations into Greek and so on…

The 'captcha's you have to type in to comment on blogs like this? A lot of them are used to check the OCR'd manuscripts going into Google Books. Google learns from scanning your email in Gmail, and the click patterns on all the sites that use Google Analytics; it's more likely to be using that kind of information to place ads on pages than to place results on search pages, but it's still using it.

Just like Microsoft has been using telemetry for years (probably one reason the Bing team sounds taken aback to be called on using telemetry; it's practically a religion at Microsoft). Microsoft has been using what it learns from the anonymous, opt-in Customer Experience Improvement Program (CEIP) about which commands Office users click the most (Paste, Save and Copy are the most common in Word, followed by Undo - and Paste is so frequently followed by Undo in action that the Paste Options popup was designed to stop you having to undo).

Some major features in Office have come out of telemetry, according to Steven Sinofsky (who used to run the Office team) - as well as more minor changes. "We learned that a very significant amount of time the first suggestion in the spelling dictionary was the right correction (hence autocorrect). We learned that no one ever read the tip of the day (“Don’t run with scissors”)." Lots of applications have autocorrection now, from OpenOffice to Google Docs; if any of them were inspired by the Office team's discovery that spell checking was good enough to be useful, that's your clickstream data out there having an influence.

Telemetry quite literally made a lot of Windows 7 what it is, as Sinofsky explained repeatedly and publicly in the Engineering 7 blog - and summed up nicely at PDC 2009. "Anytime you plug a device into Windows, we can have the opportunity to get diagnostics to learn what device is plugged in, what drivers were loaded, did the drivers come from you or from a local machine, 32- or 64-bit, was the installation of those successful? ...Another element of telemetry is what we call the software quality monitor... SQM is our way of understanding what features of Windows or any software that Microsoft makes are you using. What are the buttons you're clicking on, are you using keyboard accelerators or the sequence of events. Well, with all of these telemetry items, they'll all respect your privacy, they're all voluntary, and they're all opt-in... But it turns out that over 80% of our customers voluntarily opt-in to sending us this information." The 100 million SQM sessions a month beta users generated had a huge impact and Windows 7 is Microsoft's most popular OS in a very long time in large part because it does make it easier to do things the way people actually work - and because the telemetry gathered from beta users let the Windows team do things like working on performance until users on PCs out in the world were seeing the Start menu open in the target 50-100 milliseconds.

For IE 9, as Dean Hachamovitch put it last year, "We use many, many data sources from customer to inform what we build and how we build it. The Connect database is one of several sources; we have a SQM database, telemetry, error reporting - all these different sources of data. When you connect data from hundreds of millions of users around what they actually do and how actually do it that is extremely powerful." It's not just finding which APIs Web sites use that IE 8 didn't support and adding them, or looking at sites to see what browser subsystems need to speed up to make them load faster. IE 9 also takes the anonymised list of what executable files are being downloaded through the browser and uses them as part of building a reputation service for applications so you only see warnings for downloads that are likely to be dangerous.

Similarly, the specific malware that the Malicious Software Removal Tool looks for each month is based on what's been showing up in the telemetry from systems running Windows Defender and Security Essentials. Microsoft shares that information with the security vendors in the Microsoft Virus Information Alliance but many security vendors gather their own telemetry. Zone Alarm is free because it gathers information about security problems that make Checkpoint more valuable to paying customers. The spam button you click in your email package doesn't just get the message out of your inbox; it can end up marking the sender as a spammer on a blocklist that I can use on my mail server.

Got your iPhone phone turned on as you walk around? You're probably sending location and cellular network data back to Apple; "We may collect information such as occupation, language, zip code, area code, unique device identifier, location, and the time zone where an Apple product is used so that we can better understand customer behaviour and improve our products, services, and advertising," it says. Nokia phones and BlackBerrys collect anonymised location information to use in Ovi Maps and RIM's ETA service respectively; Vodafone takes the aggregated movements of phone users and turns them into data about traffic speeds. (There's on-going debate about whether Google gathers too much information from phones and StreetView cars when it uses your location for Google Maps on your phone.) Treating phones as 'sensors' like weather stations is the basis for what's been called the Internet of Things; that's an extension of the fact that looking at what people in general do is a good way of understanding both what people in general want and what's going on in the world.

Collectively gathering, anonymising and learning from user behaviour is what many of the technology tools we use every day are built on. The assumption is that we own our online behaviour, not that the sites that we visit own it. We want it to be anonymised and not to breach our privacy, but we give it to vendors for free - knowingly or unknowingly depending on whether we've bothered to read the licence agreements and the privacy agreements - and in return we get a service (or sometimes the opportunity to buy a service, which doesn't feel as fair). The very best thing that could come out of this heated discussion - along with the on-going discussion of Web search quality that it disrupted - would be a wider awareness of the information users contribute to technology companies and some discussion about who gets to control that and whether we want to give up the services that use machine learning to extract information from that data (because while it's valuable data, it's only data - not information or knowledge).

Right now anything that anyone can do online is up for tracking as long as you protect their privacy. If that rule is going to change, it has to change for everyone.

Mary Branscombe

Editorial standards