Pinterest logs 20 terabytes of new data each day

Pinterest logs 20 terabytes of new data each day

Summary: The budding social network wants to remind people that underneath all of the DIY wedding tips, it, too, is a big data company.

SHARE:
zdnet-pinterest-data-hadoop

Judging a book by its cover (or a social network by its interface), Pinterest might look like a simple repository for images of frilly white ballgowns and endless vegan casserole recipes.

But the budding social media company wants to make it clear that underneath it all, it, too, is a big data company. 

And like many enterprise tech stalwarts, Pinterest has demonstrated an interest of its own in open source, especially Hadoop.

Pinterest data engineer Mohammad Shahangian outlined the digital scrapbook's data infrastructure in a blog post on Thursday morning, highlighting how the Hadoop backbone surfaces relevant content and keeps the pinning momentum going:

Hadoop enables us to put the most relevant and recent content in front of users through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.

In order to build big data applications quickly, we have evolved our single cluster Hadoop infrastructure into a ubiquitous self-serving platform.

Acknowledging that Hadoop is not "plug-and-play technology," Shahangian described further how Pinterest engineers have employed "a wide range of home-brewed, open source and proprietary solutions to meet each requirement" in building a personalized discovery engine.

Here's a snapshot of just how much data is being generated through that engine powering Pinterest:

  • It logs 20 terabytes of new data daily
  • It stores approximately 10 petabytes of data in Amazon's Simple Storage Service (S3)
  • Pinterest has six standing Hadoop clusters comprised of over 3,000 nodes.
  • Developers generate more than 20 billion log messages and process nearly a petabyte of data with Hadoop each day.
  • Using the current Hadoop setup (while dabbling with managed Hadoop clusters as well), the platform requires over 100 regular MapReduce users, who, in turn, run more than 2,000 jobs daily via Qubole’s web interface, ad-hoc jobs and scheduled workflows.

Although the San Francisco-headquartered business hasn't revealed official user counts, reports say that the platform serves between 40 million and 60 million monthly active users and counting.

But Shahangian touted that there are more than 30 billion pins on the site to date.

Image via Pinterest

Topics: Big Data, Data Management, Start-Ups, Social Enterprise, Web development

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

2 comments
Log in or register to join the discussion
  • And 15 are

    And 15 of those TB are of food.
    Buster Friendly
    • And all of it....

      is just a copy of Facebook's "Like" feature.
      Joe_Raby