Facebook opens up about new infrastructure project: Apache Giraph

The engineering team cited that it is moving along with Apache Giraph because it scales "at an incredibly high rate."


Facebook has unveiled its version of Apache Giraph, touted to be the social network's next big infrastructure project.

Initially launched in 2012, Apache Giraph is an open source projected boasted to be able to unleash "the potential of structured datasets at a massive scale."

The engineering team added that it is moving along with Apache Giraph for analyzing Facebook's Social Graph because it scales "at an incredibly high rate."

For example, Facebook is touted to be able to cluster a monthly active user data set of one billion input vectors with 100 features into 10,000 centroids with k-means in less than 10 minutes per iteration.

Avery Ching, a software engineer at Facebook, explained further in a blog post that the team wanted "a programming framework to express a wide range of graph algorithms in a simple way and scale them to massive datasets."

We ended up choosing Giraph for several compelling reasons.  Giraph directly interfaces with our internal version of HDFS (since Giraph is written in Java) and talks directly to Hive.  Since Giraph runs as a MapReduce job, we can leverage our existing MapReduce (Corona) infrastructure stack with little operational overhead.   With respect to performance, at the time of testing Giraph was faster than the other frameworks - much faster than Hive.   Finally, Giraph’s graph-based API, inspired by Google’s Pregel and Leslie Valiant’s bulk synchronous parallel computing model, supports a wide array of graph applications in a way that is easy to understand.  Giraph also adds several useful features on top of the basic Pregel model that are beyond the scope of this article, including master computation and composable computation.

Giraph version 1.0.0 is already available to download through an Apache mirror.

Reps for the world's largest social network reiterated on Wednesday that graphs "are central to Facebook."

Facebook has stressed this for months, especially through a number of deep dive sessions with the media and engineering teams held at the company's Menlo Park headquarters.

For reference, the two main "graphs" are the Social Graph for people and their connections followed by the Open Graph, designed to enable developers to link objects in apps with user actions.

Chart via The Facebook Engineering Blog