Pinterest logs 20 terabytes of new data each day

The budding social network wants to remind people that underneath all of the DIY wedding tips, it, too, is a big data company.


Judging a book by its cover (or a social network by its interface), Pinterest might look like a simple repository for images of frilly white ballgowns and endless vegan casserole recipes.

But the budding social media company wants to make it clear that underneath it all, it, too, is a big data company. 

And like many enterprise tech stalwarts, Pinterest has demonstrated an interest of its own in open source, especially Hadoop .

Pinterest data engineer Mohammad Shahangian outlined the digital scrapbook's data infrastructure in a blog post on Thursday morning, highlighting how the Hadoop backbone surfaces relevant content and keeps the pinning momentum going:

Hadoop enables us to put the most relevant and recent content in front of users through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.

In order to build big data applications quickly, we have evolved our single cluster Hadoop infrastructure into a ubiquitous self-serving platform.

Acknowledging that Hadoop is not "plug-and-play technology," Shahangian described further how Pinterest engineers have employed "a wide range of home-brewed, open source and proprietary solutions to meet each requirement" in building a personalized discovery engine.

Here's a snapshot of just how much data is being generated through that engine powering Pinterest:

  • It logs 20 terabytes of new data daily
  • It stores approximately 10 petabytes of data in Amazon's Simple Storage Service (S3)
  • Pinterest has six standing Hadoop clusters comprised of over 3,000 nodes.
  • Developers generate more than 20 billion log messages and process nearly a petabyte of data with Hadoop each day.
  • Using the current Hadoop setup (while dabbling with managed Hadoop clusters as well), the platform requires over 100 regular MapReduce users, who, in turn, run more than 2,000 jobs daily via Qubole’s web interface, ad-hoc jobs and scheduled workflows.

Although the San Francisco-headquartered business hasn't revealed official user counts, reports say that the platform serves between 40 million and 60 million monthly active users and counting.

But Shahangian touted that there are more than 30 billion pins on the site to date.

Image via Pinterest