Judging a book by its cover (or a social network by its interface), Pinterest might look like a simple repository for images of frilly white ballgowns and endless vegan casserole recipes.
But the budding social media company wants to make it clear that underneath it all, it, too, is a big data company.
Pinterest data engineer Mohammad Shahangian outlined the digital scrapbook's data infrastructure in a blog post on Thursday morning, highlighting how the Hadoop backbone surfaces relevant content and keeps the pinning momentum going:
Hadoop enables us to put the most relevant and recent content in front of users through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.
In order to build big data applications quickly, we have evolved our single cluster Hadoop infrastructure into a ubiquitous self-serving platform.
Acknowledging that Hadoop is not "plug-and-play technology," Shahangian described further how Pinterest engineers have employed "a wide range of home-brewed, open source and proprietary solutions to meet each requirement" in building a personalized discovery engine.
Here's a snapshot of just how much data is being generated through that engine powering Pinterest:
- It logs 20 terabytes of new data daily
- It stores approximately 10 petabytes of data in Amazon's Simple Storage Service (S3)
- Pinterest has six standing Hadoop clusters comprised of over 3,000 nodes.
- Developers generate more than 20 billion log messages and process nearly a petabyte of data with Hadoop each day.
- Using the current Hadoop setup (while dabbling with managed Hadoop clusters as well), the platform requires over 100 regular MapReduce users, who, in turn, run more than 2,000 jobs daily via Qubole’s web interface, ad-hoc jobs and scheduled workflows.
But Shahangian touted that there are more than 30 billion pins on the site to date.
Image via Pinterest