Big data and the big privacy problem

Written by Simon Bisson, Contributor and Mary Branscombe, Contributor May 1, 2011 at 2:18 p.m. PT

If privacy is dead (as a number of technology executives in whose interest it is for us not to care about privacy have opined), there wouldn't have been much fuss over the most recent time researchers discovered that iPhones - like pretty much every other phones in the world - track your location and use it to build up maps and traffic information. Often that's where those handy green and red traffic lines on the maps come from; detecting how fast how many people are driving (because digging up the roads to put sensors in is really expensive, compared to scrubbing off the identity from the incoming flood of location data from phones). Dash - now owned by RIM - used to boast that it put the new road on its map the day the Google off-ramp opened in silicon valley rather than having to wait a month to get an update, based on just a handful of drivers using its devices. Better information faster; that's what we all want, so what's the problem?

In the case of the iPhone location trail, the problem is one, that the average user didn't dig through umpteen pages of the ITunes EULA to see that Apple was asking for this information (Microsoft identity architect Kim Cameron has been particularly scathing about the impenetrability of the ITunes EULA on his personal blog for the last year) and two, that leaving the location traces on the phone in a poorly protected file seems like an invitation to snoopers (legal and illegal) to grab information directly from the phone. That's less data mining and more data strip mining… The news that Android and Windows Phone also collect and anonymise location information isn't a surprise (and Microsoft yet again surprises critics by turning out to have the strongest privacy policy - and it hardly matters whether it's poacher turned gamekeeper syndrome or an honest belief that privacy matters or a cynical belief that privacy matters and is a good way to compete). But while location gets a lot of attention because it's such a personal, private thing, there are a lot more sets of data out there that the industry needs to be having a conversation about.

One of them is going on in the US Supreme Court right now in a case about whether data mining companies can sell (anonymised) information they've gathered about the prescription of brand name drugs (instead of cheaper generics) back to the pharmaceutical companies. The Supreme Court is looking at how the case affects commercial free speech and whether the State of Vermont is just trying to stop drug marketing, but the decision could set a precedent for wider issues about turning masses of data into useful information, something that was the key part of the agenda at the Data2.0 conference a few weeks ago.

Thanks to GPU computing, grids and commoditised high-performance computing, we can process in minutes or hours what used to take months and years. That's a huge benefit, in medicine and other areas. A few years ago experts suggested we'd reached peak oil and there were no major new 'elephant' fields to find; most of the elephant fields found since then have come by examining the survey data gathered years ago with faster computing techniques. The recent resurgence of fundamental research in AI and computer vision has been driven by the fact that a cheap graphics card or four can give you the power of a hugely expensive Unix workstation; Nvidia's conferences are split between the hard core gamers and the hard core researchers these days. And Google, Microsoft and - to a lesser extent - Apple are building services based on machine learning driven by huge data sets. Those range from the Internet itself to your smartphone's location history, your search history, hours of voice recordings or pages of handwriting or thousands of scanned books or information from sensors recently nicknamed the Internet of Things. The way Kinect can tell what's your hand, what's your hip and what you're shouting over a game at full volume? The uncannily accurate spelling correction and word prediction on Windows Phone? Google translation of Web pages? The immediate spelling correction in Google Wave? The location information in Google Maps (and that fake village in Lancashire a few miles from where I grew up)? Machine learning.

The technique of taking vast amounts of data and feeding it into a system that uncovered patterns and correlations isn't new (the mathematics that underlie it go back to George Boole in the 1840s); the power to do it quickly, the accessible data sets to feed it and the source of those data sets are. When Microsoft developed the handwriting algorithms for the tablet PC a decade ago, they got handwriting samples from thousands of volunteers who knew what they were for. When you use Gmail you probably do know that your email is being mined to teach Google Ads about language and ideas so it know what ads to show you. When you drive around with a TomTom GPS, you probably didn't know that aggregate traffic patterns and speeds were being sold as a data set that the police in the Netherlands bought and used to set speed cameras on roads that are both dangerous and routinely driven at over the legal speed limit. It's the same issue as Bing using the 'clickstream' of where you go and what you click and ending up replicating a small percentage of (false and deliberately-created) Google results; who has the rights to the information that comes out of the aggregation of data?

And what about when it's not actually that anonymous? Again and again at Data 2.0, companies based on aggregating and selling information talked about what they were doing - and found they were talking about privacy. Visit a Web site that uses the Triggit ad system and it grabs your IP address and looks up what it knows about you - including where the Quova IP geolocation service thinks you are - and decides who you as a user are valuable to. Are you someone Amazon will want to show an ad to? "In about 120 milliseconds," said Triggit CEO Zach Coelius, "real-time in the background there's a marketplace and they bid in the auction to assign this ad to you." Martin Wesley of BrightTag - which handles the tracking tags that help advertisers follow you from site to site in a way that lets Web sites choose what data is collected and which marketing partner gets to see it - sounded a cautionary note with a reference all the way back to 1999 when the DoubleClick ad service bought Abacus (before Google bought DoubleClick) and privacy advocates worried about all that personal data being used to serve ads. "Make sure your privacy policy lines up with what you're doing with the data. Everyone in the industry needs to handle this with care or this will set the industry back."

Is it my data in the first place, someone asked? How do I get my cut? That's hard, said Miten Sampat of Quova, and besides you're already getting rewarded. "The answer is yes, it would be nice if there was a way for consumers to be compensated for opting in giving data to the ecosystem, it's just a hard problem to solve right now. There's so much confusion about what opting out is.. This vision will come from somebody who creates a plugin that fits in a browser that understands all the data I'm sharing. But what you're getting already is free content - that really is what you're getting in exchange."

That could be creepy said Sam Ramji of API aggregator platform Apigee (creepy is the line Google tries to get right up to but not to cross, as Larry Page put it a few months ago). "The challenge is the boundary of personalisation and privacy. It's wonderful to have a personalised web experience, it's even better to have a personalised app experience but I'm worried about what's happening to the data. I haven't found a way to say to Facebook I want you to be able to express this info about me to my app without telling them who I am. That's pretty creepy; I don’t want to live in that world. How can we enrich the database with our data and prevent the creepy factor or privacy violations creeping in as try to create personalised experience?"

Terry Jones of cloud data service Fluidinfo wanted users to be involved; "We as normal people are using apps and they're storing data on our behalf. They shouldn’t have the last world on our data we should be able to add it or edit it…" Interestingly, the Cabinet Office recently launched the Better Choices, Better Deals strategy for mining data about what's a good value service (think information to help you switch electricity provider on steroids for multiple services), promising a new service called ‘mydata’ "which will enable consumers to access, control and use data currently held about them by businesses". And there's a privacy-first social network called Connect.Me launching soon that promises to let you choose who gets to see what information about you (as well as letting you vouch for the identity of people you know personally). The question is whether people care enough - when the latest news scare story has passed - to curate their own data.

Andreas Weigens, Amazon's chief scientist, doesn't think so - and he started his presentation by showing his Stasi record. "It's a myth that people are interested in privacy. Look anywhere you want to look and it's maybe just some politicians interested in privacy. When you give people an opportunity to share, they will share." But not only is it in Amazon's interest to get you to share, that attitude leaves out all the issues about whether people know what they're sharing and with who.

Just a few weeks before the whole smartphone location issue blew up the CEO of location programming service SimpleGeo Jay Adelson told the Data 2.0 conference something that now sounds slightly naive. "In general most of the users who have access [to location permissions in apps] are aware of what's being collected. I'm not sure we’ve really run into abuse of that data."

And Chris Palmer of the Electronic Frontier Foundation said bluntly. "To get to a unique id, if I have your birthday and zip code I know who you are - the end."

Anonymising data, adding value: the boundaries aren't clear so Palmer suggested some useful principles. "The challenge is to make sure data mining doesn't become data strip mining - that we don't burn down the forest to make a lot of money quick but with no long term value. In a lot of business models today, the issue is that the value proposition is vague... Everyone is skimming off that ambiguity; minimising customer surplus and maximising their own. Without trust, without trustworthy behaviour, it's strip mining. If you can't say what you do for a living in one sentence, it's probably illegal. If you can't say to the consumer what it is you do in a way they can understand - maybe you should reconsider what you do."

Mary Branscombe

Editorial standards

Show Comments

Makita Impact XPS Mag Boost and a set of Wera Screw Grippers

Big data and the big privacy problem

Related

My 2 must-have tools to make DIY projects a lot less frustrating (and they're cheap)

This $349 iPad was secretly the best announcement during Apple's event

The best indoor TV antenna you can buy: Expert tested