Facebook readies HydraBase to cut HBase downtime to five minutes a year

Facebook is deploying an upgrade that should help boost the resilience of its Apache HBase system.

On a quest to cut Messages downtime to five minutes a year, Facebook is preparing to roll out Hydrabase, new tech that should make its widely used Apache HBase system more resilient to region failures.

Facebook's infrastructure engineers announced the new system in a blog post on Thursday, flagging uptime improvements on the way to dozens of Facebook services and products that are built on HBase, the database management system which runs on top of the Hadoop Distributed File System (HDFS).

Improving HBase's resilence is a pretty important task for Facebook, given how widely it's used. Products and service Facebook has built on HBase include Messages, online analytics processing workloads, an internal monitoring system, the new Nearby Friends feature, search indexing, streaming data analysis, and data scraping for its internal data warehouses.

Facebook engineers are currently testing HydraBase and plan to roll out in phases to its production clusters. Once deployed across multiple data centres, they reckon it could increase HBase availability from 99.99 percent to 99.999 percent — meaning a maximum of five minutes downtime a year.

What a HydraBase deployment might look like
What a HydraBase deployment might look like. Image: Facebook

The main improvement HydraBase brings is distributing the role of standby servers that lay in waiting for when one of its region servers crash.

Currently, as Facebook's engineers explain: "When a region server fails, all the regions hosted by that region server will migrate to another region server, providing automatic failover."

While it is automatic, the process of migrating files in HBase causes delays. HydraBase tackles this by "decoupling logical and physical replication".

"Instead of having each region being served by a single region server, in HydraBase, each region is hosted by a set of region servers. When a region server fails, there are standby region servers ready to service those regions," Facebook’s engineers write.

"These standby region servers can be spread across different racks or even data centres, providing availability across different failure domains. The set of region servers serving each region form a quorum. Each quorum has a leader that services read and write requests from the client.

"Within each quorum, each member is either in active or witness mode. Active mode members are writing to HDFS and performing data flushes and compactions. Witness mode members only participate in replicating the WALs, but can assume the role of the leader when the leader region server fails."

Read more on Facebook