Search engine data not anonymous enough for researchers

Academics decide not to use data that makes it too easy to identify specific individuals.
Written by ZDNET Editors, Contributor

When it comes to data mining, academic research has long been the poor stepchild to the giant search engines who own the data. Data has become so commercialized that even when researchers get fresh data, privacy questions are raised. The New York Times reports that data about people's behavior mined from corporate search engines such as AOL can reveal an individual's identity, spoiling the anonymity of the research.

Jon Kleinberg, a professor of computer science, downloaded newly released query logs to a publicly accessible Web site late last month. He decided against using it after a brouhaha erupted over possible privacy breaches.

“Now it’s sitting there, in cold storage,” said Prof. Kleinberg, who works on algorithms for understanding the structure of the Web and searching it. “The number of things it reveals about individual people seems much too much. In general, you don’t want to do research on tainted data.”

The privacy controversy has the academic community arguing over how to use raw data. Data on people's behavior is plentiful on large search engines, but it remains under corporate lock and key, and only occasionally made available to academic researchers.

After it was determined that researchers could identify some of the subjects, AOL quickly withdrew the data from its research Web site, but not before it had been downloaded, reposted and made searchable at a number of Web sites.

For the last 10 years, academia has had to make due with research data from Excite and one from Alta Vista, which have long outlived their usefulness.

“The way people use search engines now is totally different,” Kleinberg said. “Partly because what you expected to get out of a search engine back then was much less, so people didn’t try anything too fancy.”
One solution could lie in more stringent “scrubbing” of data in a way that did not diminish its quality as a research tool. For example, replacing numbers that carry identifying information — like Social Security numbers and ZIP codes — with zeros, or replacing the word “New York” with “X17.”
Professor Kleinberg said he hoped that over time, the AOL incident would lead to “a richer, more informed discussion about what it means to create data sets that are clean and anonymized.”
Editorial standards