Yahoo opens up 13.5TB machine learning dataset for academic research

Yahoo is publishing the dataset with the goal of encouraging innovation -- but especially in regards to how data from machine learning technologies can be turned around and used for new purposes.

Artificial intelligence in your shopping basket: Machine learning for online retailers

AI techniques are becoming part of every day computing: here's how they're being used to help online retailers keep up with the competition.

Yahoo is unloading what it is boasting to be the largest-ever machine learning dataset made available publicly for the academic research community.

Suju Rajan, director of research at Yahoo Labs, elaborated in prepared remarks that the search company is publishing the dataset with the goal of encouraging innovation -- but especially in regards to how data from machine learning technologies can be turned around and used for new purposes.

"Many academic researchers and data scientists don't have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies," Rajan remarked.

Dubbed the Yahoo News Feed dataset, the collection is actually just a sample set of anonymized user interactions from approximately 20 million users tuning into a variety of Yahoo properties, including Yahoo Finance, Sports, Movies, Real Estate and the general homepage as well as News.

At 13.5 terabytes (or 13,500 gigabytes) of uncompressed data, the pool covers a swath of more than 110 billion events between February and May 2015 alone.

Reiterating the user data is anonymous, among the metrics available to researchers include age range, gender, and generalized geographic data along with time stamps, items, titles, summaries and key phrases for articles and other accessed content on top of what device or channel was used for viewing.

Yahoo has already enlisted a few academic partners to tap into the dataset.

The Jacobs School of Engineering at the University of California, San Diego plans to use the data with the hopes of improving ongoing research in machine learning, artificial intelligence and big data applications."

"Access to datasets of this size is essential to design and develop machine learning algorithms and technology that scales to truly 'big' data," explained Gert Lanckriet, a professor in the department of electrical and computer engineering at UC San Diego, in Thursday's announcement.

Researchers can access the dataset through the online Yahoo Labs Webscope library for its data-sharing program.

Newsletters

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All