Among the announcements that came from Amazon's re:Invent conference last week, there was news of adding HBase to Amazon's Elastic MapReduce (EMR) offering based on S3 storage. Besides making EMR a more feature-complete Hadoop, it allows HBase to take advantage of the more economical storage that S3 offers. You can size your EMR cluster for compute instead of data requirements, as you can avoid the need for the customary 3x replication in HDFS.
The HBase port, co-developed by FINRA and Amazon, was the outgrowth of a strategy for FINRA that rethought, not only how to optimize workloads, but also how to optimize IT processes so the organization could best take advantage of the cloud.
FINRA began thinking about migrating to the cloud when it was asked in 2013 to bid on the new SEC Consolidated Audit Trail project; while the project award has yet to be made, FINRA is currently one of three finalists. FINRA decided to base its proposal on cloud deployment because the project, which would create a single database for tracking all capital markets trading activity, would be receiving data from multiple sources.
That in turn planted the seed for FINRA to consider migrating its existing analytics infrastructure, which originally consisted of a mix of Netezza, Greenplum, SAS, and Cloudera Hadoop, to a data lake in the Amazon Web Services cloud.
FINRA realized that simply "lifting and shifting" infrastructure and workloads to the cloud would yield only limited cost and deployment speed benefits compared to running on premises. Taking full advantage of the cloud meant rethinking how applications and systems work, and more importantly, how they should be redesigned when running in an elastic environment separating storage from compute. And there was the people and process side of that, too.
The cloud provided FINRA a number of options. It could move ongoing workloads to clusters that could be reserved for extended contract periods at discount. It could also take advantage of the cloud's elasticity to run more ad hoc workloads, especially for exploratory analytics, which might not have made the cut on premises where capacity and/or IT resources and backlogs were showstoppers. And if the workload was deemed highly discretionary, or cost was a real constraint, there was always Amazon's spot market.
That didn't necessarily take IT off the hook. By relieving IT of the need to provision "what-if" capacity, the systems team could shift roles from gatekeeper to facilitator, while planning of infrastructure shifts toward selection from a large menu -- both challenges being harder to accomplish than they sound.
The flexibility and the multiplicity of options in the cloud was a double-edged sword: it provided the opportunity to tune compute and storage for the data and workload/application, but given the thousands of permutations available on AWS, it requires architects with specialized cloud knowledge to master the optimal combinations.
Furthermore, commandeering cloud compute still required DBAs to automate database builds. It also demanded a shift toward DevOps, both in mindset and organization. FINRA consolidated systems engineering and operations into unified teams on the assumption that in the cloud environment, both would have to work side by side in concurrent, not sequential engineering mode.
When it came to rethinking architecture and topology, one decision was fairly straightforward: the active part of the data lake would be stored on economical S3 object storage so it could be accessible to a wide variety of processes from interactive SQL to complex batch. And the colder data would be shunted to Glacier archive. But integrating data sources and choosing the right targets becomes the same reengineering challenge that it is for on premises -- you may or may not be looking at data platform migration. Amazon's choices are anything but limited there.
The data platform rationalization step is what led FINRA to go where Amazon had not gone before: getting HBase to run on S3. The EMR platform (which runs on S3) works well enough for Hive, Spark, and Presto batch and interactive workloads, but it didn't directly support HBase. FINRA was increasingly relying on HBase to replace its former petabytes-size Greenplum data warehouse. If it did a "lift and shift" strategy to EMR, it would still have to run HBase on separate instances of HDFS storage in EC2. And that in turn would involve the operational complexity and cost of replicating data from S3, and then storing it on a more expensive target.
Getting HBase to run directly on S3 would avoid all those issues. As a strategic customer with a strategic project to both parties, FINRA got Amazon's support to do the port. While, compared to HDFS, HBase runs slightly slower on S3, the end results of a 400x improvement over running the same queries on Greenplum and streamlining of the data more than offset the minor performance differences.
The bottom line? With HBase running on S3 under EMR, rather than in a separate EC2 Hadoop instance, FINRA saved $1 million annually.