When Microsoft started out dipping its toes into the Hadoop waters, it worked with Hortonworks to port Hadoop to Windows and run it in the Azure cloud. But running Hortonworks Data Platform (HDP) for Windows meant HDInsight (as Hadoop on Azure was eventually branded) was always a step behind the more mainstream Linux distributions, and constantly playing catch-up. When Microsoft decided to offer HDInsight clusters running on Linux, everything changed. Support from across the industry materialized and the newest Hadoop features were added to the service in much faster timeframes.
Still, HDInsight has been due for a polishing, and today Microsoft is announcing just that. A new version of HDInsight, based on HDP 2.5 is launching today and, along with it, some Microsoft-specific security and application integrations that make HDInsight a contender for leading cloud Hadoop offering.
Spark in its eye
So what's inside? Apache Spark 2.0, to start. This version of Spark includes technology from Project Tungsten, giving Spark the power of vectorized computations. Along with the new version of Spark itself, HDInsight will now include support for Apache Zeppelin notebooks, which let developers build scrapbook-like compositions of code and data visualizations that run on Spark.
Also read: Spark comes to Azure HDInsight
HDInsight had already offered similar capabilities using Jupyter, another open source notebook technology. But it's nice to see HDInsight include both notebook technologies, in parity with most other Hadoop offerings. Another nice Spark-related addition is that of a Spark-HBase connector, allowing Spark SQL to be used -- from notebooks or elsewhere -- to query data in Apache HBase.
Hive moves into express lane
Using HDP 2.5 under the hood also means that Microsoft can ship Apache Hive's new LLAP ("Live Long And Process") mode, stemming from the "Stinger.Next" initiative around Hive. As I reported a year and a half ago, the technology combines Hive running on Apache Tez with caching, vectorization, and other optimizations to deliver what both Microsoft and Hortonworks claim are sub-second response times.
Also read: SQL and Hadoop: It's complicated
Also read: Hadoop Summit news: ecosystem order, and fragmentation
Microsoft says that overall, this new Hive implementation can improve performance over twenty five-fold over the previous Hive on Tez implementation it was shipping.
This new version of Azure HDInsight also includes integration with Azure Active Directory (which, in turn, can integrate with on-premises Active Directory installations), and the transparent ability to encrypt data at rest. The latter capability, when combined with the use of Azure Data Lake Store, allows encryption keys to be customer-managed using the Azure Key Vault service.
Platforming on HDP 2.5 also means that HDInsight will now include Apache Ranger (incubating), the technology that derived from Hortonworks' 2014 acquisition of XA Secure.
Also read: Hadoop security: Hortonworks buys XA Secure -- and plans to turn it open source
Ranger provides a fine-grained, role-based access control layer over Hadoop and its various distro components. Support for Ranger, as well as Apache Sentry, is becoming de rigueur in the Hadoop world, so Ranger's addition to HDInsight is probably a good thing for Microsoft, and its customers.
It's a third-party, party
Finally, Microsoft is launching a new option for integration of third party ISV (independent software vendor) applications with HDInsight. Called the Azure HDInsight Application Platform, it allows ISV apps to be provisioned along with the HDInsight cluster or easily added to existing clusters, and to have access to the cluster and its resources as it could in an on-premises installation.
(Full disclosure: my employer, Datameer, was the first ISV whose application was onboarded to the HDInsight Application Platform, and Microsoft's press release on today's HDInsight announcements includes a quote from my boss, Stefan Groschupf, Datameer's CEO.)
But Microsoft is also announcing that Cask and StreamSets are joining the Azure HDInsight ISV program, too. Datameer will be in good company: Cask offers an excellent, unified API for developers that spans the entire Hadoop stack, and Spark. And StreamSets, which I wrote about just a couple of weeks ago, offers a management platform for data flow processing machine-generated streaming data.
Also read: Can Big Data operations be manageable? Two companies say yes.
When can I open it?
The roll-out of HDP 2.5 and Spark 2.0 goes to general availability today. If you also would like to take advantage of LLAP mode in Hive, you'll need to provision a special HDInsight cluster type that is available in preview.
If you're an Azure customer, you probably want to get your hands on this. I know I do, as I'll be delivering a presentation on HDInsight at a conference next week. And isn't it just like a cloud platform to make your presentation obsolete just a week before you're set to deliver it?