X
Business

Yahoo opens up 13.5TB machine learning dataset for academic research

Yahoo is publishing the dataset with the goal of encouraging innovation -- but especially in regards to how data from machine learning technologies can be turned around and used for new purposes.
Written by Rachel King, Contributor

Yahoo is unloading what it is boasting to be the largest-ever machine learning dataset made available publicly for the academic research community.

Suju Rajan, director of research at Yahoo Labs, elaborated in prepared remarks that the search company is publishing the dataset with the goal of encouraging innovation -- but especially in regards to how data from machine learning technologies can be turned around and used for new purposes.

"Many academic researchers and data scientists don't have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies," Rajan remarked.

Dubbed the Yahoo News Feed dataset, the collection is actually just a sample set of anonymized user interactions from approximately 20 million users tuning into a variety of Yahoo properties, including Yahoo Finance, Sports, Movies, Real Estate and the general homepage as well as News.

At 13.5 terabytes (or 13,500 gigabytes) of uncompressed data, the pool covers a swath of more than 110 billion events between February and May 2015 alone.

Reiterating the user data is anonymous, among the metrics available to researchers include age range, gender, and generalized geographic data along with time stamps, items, titles, summaries and key phrases for articles and other accessed content on top of what device or channel was used for viewing.

Yahoo has already enlisted a few academic partners to tap into the dataset.

The Jacobs School of Engineering at the University of California, San Diego plans to use the data with the hopes of improving ongoing research in machine learning, artificial intelligence and big data applications."

"Access to datasets of this size is essential to design and develop machine learning algorithms and technology that scales to truly 'big' data," explained Gert Lanckriet, a professor in the department of electrical and computer engineering at UC San Diego, in Thursday's announcement.

Researchers can access the dataset through the online Yahoo Labs Webscope library for its data-sharing program.

Editorial standards