The Odd Couple: Hadoop and Data Security

Cloudera has launched its Sentry security module for Hive and Impala. But is a more comprehensive security solution for Hadoop possible?
Written by Andrew Brust, Contributor

This guest post comes courtesy of Tony Baer’s OnStrategies blog. Tony is a principal analyst, covering Big Data, at Ovum.
By Tony Baer

Until now, security wasn’t a term that would normally be associated with Hadoop. Hadoop was originally conceived for a trusted environment, where the user base was confined to a small elite set of advanced programmers or statisticians and the data was largely weblogs. But as Hadoop crosses over to the enterprise, the concern is that many enterprises will want to store pretty sensitive data.

So what stood for Hadoop security up till now was using Kerberos to authenticate users for firing up remote clusters if the local ones were already used; accessing specific MapReduce tasks; applying coarse-grained access control to HDFS files or directories; or providing authorization to run Hive metadata tasks.

Security is a headache for even the most established enterprise systems. But getting beyond concern about Trojan Horses and other incursions, there is a laundry list of capabilities that are expected with any enterprise data store. There are the "three As" of Authorization, Access, and Authentication; there is the concern over the sanctity of the data; and there may be requirements for restricting access based on workload type and capacity utilization concerns. Clearly there are many missing pieces for Hadoop waiting to fall into place.

For starters, Hadoop is not a monolithic system, but a collection of modules that correspond to open source projects or vendor proprietary extensions. Until now the only single mechanism has been Kerberos tokens for authenticating users to different Hadoop components, but only standalone mechanisms for admitting named users (e.g. to Hive metadata stores). You could restrict user access at the file system or directory level, but for the most part such measures are only useful for preventing accidental erasures. And when it comes to monitoring the system (e.g., JobTracker, TaskTracker), you can do so from insecure Web clients.

There are more missing pieces concerning data, as nothing was built into the Apache project. There was no standard way for encrypting data, and neither was there any way for regulating who can have what kinds of privileges with which sets of data. Obviously, that matters when you transition from low level Web log data to handling names, account numbers, account balances or other personal data. And by the way, even low level machine data such as Web logs or logs from devices that detect location grow sensitive when associated with people’s identities.

There’s a mix of activity on the open source and vendor proprietary sides for addressing the void. There are some projects at incubation stage within Apache, or awaiting Apache approval, for providing LDAP/Active Directory linked gateways (Knox), data lifecycle policies (Falcon), and APIs for processor-based encryption (Rhino). There’s also an NSA-related project for adding fine-grained data security (Accumulo) based on Google BigTable constructs. And Hive Server 2 will add the LDAP/AD integration that’s current missing.

For the near term, not surprisingly, vendors are working to fill the void. Besides employing file system or directory-level permissions, you could physically isolate specific clusters, then superimpose perimeter security. Or you could use virtualization to similar effect, although that was not what it was engineered for. Zettaset Secure Data Warehouse includes role-based access control to augment Kerberos authentication, and applies a form of virtualization with its proprietary file system swap-out for HDFS; VMware’s Project Serengeti is intended to open source the API to virtualization.

Activity monitoring, what happens to data over its lifecycle – is supposed to be the domain of Apache Falcon, which is still in incubation. But vendors are beating the Apache project to the punch: IBM has extended its InfoSphere Guardium data lineage tool to monitor who is interacting with which data in HDFS, HBase, Hive, and MapReduce. Cloudera has recently introduced the Navigator module to Cloudera Manager to monitor data interaction, while Revelytix has introduced Loom, a tool for tracking the lineage of data feeding into HDFS. These offerings are the first steps — but not the last word — for establishing audit trails. Clearly, such capability will be essential as Hadoop gets utilized for data and applications subject to any regulatory compliance and/or enforcing any types of privacy protections over data.

Similarly, there are also moves active for masking or encrypting Hadoop data. Although there is nothing built into the Hadoop platform that prevents data masking or encryption, the sheer volume of data involved has tended to keep the task selective at best. IBM has extended InfoSphere Optim to provide selective data masking for Hadoop; Dataguise and Protegrity offer similar capabilities for encryption at the HDFS file level. Intel however holds the trump cards here with the recent Project Rhino initiative for open sourcing APIs for implementing hardware-based encryption, stealing a page from the SSL appliance industry (it forms the basis of Intel's new, hardware-optimized Hadoop distro).

It's in this context that Cloudera has introduced Sentry, a new open source project for providing role-based authorization for creating Hive Server 2 and Impala 1.1 tables. It’s a logical step for implementing the types of data protection measures for table creation and modification that are taken for granted in the SQL world, where privileges are differentiated by role for specific actions (e.g., select views and tables, insert to tables, transform schema at server or node level).

There are several common threads to Cloudera’s announcement. It breaks important new ground for adding features that gradually raise Hadoop security to the level of the SQL database world, yet also reveals what’s missing.

For instance, the initial release lacks any visual navigation of HDFS files, Hive or Impala metadata tables, and data structures; it also lacks the type of time-based authorization that is common practice with Kerberos tickets. Most importantly, it lacks direct integration with Active Directory or LDAP (it can be performed manually), and it stores permissions locally. Sentry is only version 1; much of this wish list is already on Cloudera’s roadmap.

Correction: Sentry does leverage Kerberos for time windowing of access privileges and utilizes Linux utilities for tying in with directories such as Active Directory; we have not seen how well these OS functions are integrated into the Sentry engine. While Sentry leverages visual file navigation via the Apache Hue project GUI, it currently lacks a visual interface for configuring group/role permissions. Currently a standalone utility, we expect more integration with related capabilities, such as monitoring data activity, will be on Cloudera’s roadmap.

All these are good starts, but the core issue is that Sentry, like all Hadoop security features, remain point solutions lacking any capability for integrating to common systems of record for identity and access management – or end user directories. Sure, for many of these offerings, you can perform the task manually and wind up with piecemeal integration.

Admittedly, it’s not that single sign-on is ubiquitous in the SQL world. You can take the umbrella approach of buying all your end user and data security tools from your database provider, but there remains an active market for best in breed – that’s popular for enterprises seeking to address points of pain. But at least most best of breed SQL data security tools incorporate at least some provision for integration with directories or broader-based identity and access management systems.

Incumbents such as IBM, Oracle, Microsoft, and SAP/Sybase will be (and in some cases already are) likely to extend their protection umbrellas to Hadoop. That places onus on third parties – both veteran and startups – on the Hadoop side to get their acts together and align behind some security gateway framework allowing centralized administration of authentication, access, authorization, and data protection measures. Incubating projects like Apache Knox offer some promise, but again, the balkanized nature of open source projects threatens to make this a best of breed challenge.

For Hadoop to go enterprise, the burden of integration must be taken off the backs of enterprise customers. For Hadoop security, can the open source meritocracy stop talking piecemeal projects before incumbents deliver their own faits accomplis: captive open source silos to their own security umbrellas?

Editorial standards