/>
X

Yahoo opens up 13.5TB machine learning dataset for academic research

Yahoo is publishing the dataset with the goal of encouraging innovation -- but especially in regards to how data from machine learning technologies can be turned around and used for new purposes.
rachel-king-640x465.jpg
Written by Rachel King on

Yahoo is unloading what it is boasting to be the largest-ever machine learning dataset made available publicly for the academic research community.

Suju Rajan, director of research at Yahoo Labs, elaborated in prepared remarks that the search company is publishing the dataset with the goal of encouraging innovation -- but especially in regards to how data from machine learning technologies can be turned around and used for new purposes.

"Many academic researchers and data scientists don't have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies," Rajan remarked.

Dubbed the Yahoo News Feed dataset, the collection is actually just a sample set of anonymized user interactions from approximately 20 million users tuning into a variety of Yahoo properties, including Yahoo Finance, Sports, Movies, Real Estate and the general homepage as well as News.

At 13.5 terabytes (or 13,500 gigabytes) of uncompressed data, the pool covers a swath of more than 110 billion events between February and May 2015 alone.

Reiterating the user data is anonymous, among the metrics available to researchers include age range, gender, and generalized geographic data along with time stamps, items, titles, summaries and key phrases for articles and other accessed content on top of what device or channel was used for viewing.

Yahoo has already enlisted a few academic partners to tap into the dataset.

The Jacobs School of Engineering at the University of California, San Diego plans to use the data with the hopes of improving ongoing research in machine learning, artificial intelligence and big data applications."

"Access to datasets of this size is essential to design and develop machine learning algorithms and technology that scales to truly 'big' data," explained Gert Lanckriet, a professor in the department of electrical and computer engineering at UC San Diego, in Thursday's announcement.

Researchers can access the dataset through the online Yahoo Labs Webscope library for its data-sharing program.

Related

How to spot a deepfake? One simple trick is all you need
facial-recognition

How to spot a deepfake? One simple trick is all you need

AI & Robotics
Malcolm Gladwell says working from home is 'not in your best interests'. The reality is much more complicated
malcolm-gladwell

Malcolm Gladwell says working from home is 'not in your best interests'. The reality is much more complicated

Productivity
We wanted to make things worse, says McDonald's, but it costs too much money
screen-shot-2022-07-27-at-4-14-42-pm.png

We wanted to make things worse, says McDonald's, but it costs too much money

Business