Video: How machine learning's big data loop works
The Strata Data confab is New York is perhaps the annual Big Event for Big Data. Not only does Strata have NYC all to itself (DataWorks has San Jose, CA as its sole US location) but it happens in late September, just after the lazy haze of summer has worn off and right before the corporate budget process kicks in at a number of Enterprise companies.
The show just ended, and it was a biggie. This year, Strata had the Javits Convention Center all to itself, rather than playing alongside another event. That made for crowds that were bigger but also less dense, sometimes giving off a false sense of lowered vitality. Perhaps the same could be said for Hadoop itself.
Industry to elephant: Go jump in a lake
I say that because we seem to be in a new "don't ask, don't tell" environment for Hadoop, the first industry trend I want to acknowledge, even if it's a trend around lack of acknowledgement. Consider that Strata itself has dropped the "+ Hadoop World" submoniker and that vendors, almost unanimously, have ceased making explicit reference to Hadoop, as well.
The term that has replaced the "H word" is "data lake." But make no mistake, that term is a mere euphemism for a collection of data stored in HDFS (the Hadoop Distributed File System) or in cloud object storage like AWS (Amazon Web Services) Simple Storage System (S3), Azure Blob Storage and Azure Data Lake Store. And since all three of those can be made to emulate HDFS or function in an equivalent manner to it, they are not really the exception they may appear to be.
In this, well, closeted environment, what have the
Hadoop distribution vendors brought to market at the show? Cloudera which, along with O'Reilly, is the event's host, introduced its Shared Data Experience (SDX) product, to aid customers in managing multiple (Hadoop) clusters, whether they be persistent ones on-premises or ephemeral ones in the cloud. Essentially, Cloudera has factored HCatalog, Apache Sentry, and Cloudera Navigator out of the cluster infrastructure and into a single infrastructure set that can be shared across clusters. Cloudera has also added a solution that performs backup and disaster recovery from on-premises clusters to S3.
Hortonworks introduced its DataPlane Service (DPS) that, while not identical to Cloudera's SDX, is nonetheless also focused on federation and management of multi-cluster data environments, the importance of which constitutes trend #2 for the show.
MapR had an announcement of its own: a new version of MapR-DB, its HBase-compatible database. This new version adds native secondary indexes; OJAI (Open Jason Application Interface) 2.0 APIs; optimized integration of Apache Drill; native Spark and Hive connectivity; and a new global change data capture (CDC) facility.
Big vendors, big announcements
Strata isn't just about (organizations formerly known as) Hadoop vendors; it's about mega-vendors too. Two of them had their own conferences going on this week: SAP had its TechEd event in Las Vegas, and Microsoft had its Ignite event (which used to be called TechEd) in Orlando. Both vendors were also at Strata in full force, and both had data-related announcements.
SAP introduced its Data Hub product early in the week, which I covered. It also announced updates across its analytics portfolio running in the cloud and on hybrid deployments, including Lumira 2.0.
Also read: SAP unveils its Data Hub
Microsoft announced the 2017 release of its SQL Server relational database, which I covered as well. It also announced a massively revamped Azure Machine Learning service, which now consists of Azure Machine Learning Experimentation Service and Azure Machine Learning Model Management Service. There's tooling too, including a cross-platform desktop application called Azure Machine Learning Workbench, which integrates with Jupyter notebooks and, via an extension, with Microsoft's cross-platform programming editor, Visual Studio Code.
One tidbit about SQL Server 2017 that wasn't in my full-length review: licensees of the product also get rights to run the standalone Microsoft Machine Learning Server (formerly known as R Server), and not just the embedded SQL Server Machine Learning Services. Think Microsoft is obsessed with the mainstreaming of machine learning and AI? They're not alone. In fact, don't look now but that's trend #3.
Speaking of Microsoft, a number of vendors at Strata announced new products and offerings for its Azure cloud. Cloudera itself announced the Beta of a new Altus Data Engineering offering on Azure; WANdisco Fusion now integrates with HDInsight; and Scality Connect for Microsoft Azure Blob Storage, which creates an Amazon S3 compatibility layer, was announced as well.
While I am tempted to call increased support for Azure a trend in and of itself, the fact remains that AWS is still the first cloud that rolls off people's tongues. AWS got some special love of its own at Strata, with DatafactZ and Splice Machine each having announced offerings on AWS.
Splice Machine is a relational database that sits on top of both Apache HBase and Apache Spark (using one engine or the other, in a query-dependent fashion). Splice Machine now offers an external tables facility, allowing data to be left in delimited, Avro, Parquet and ORC file formats on HDFS but still be treated as a SQL-queryable table in the database catalog. External tables are treated in a first class manner, with generated statistics that are used by Splice Machine's cost-based query optimizer. Splice's new AWS marketplace offering means you can spin up Splice Machine, get billed by Amazon for your usage and not have to worry about setting up a discrete
Hadoop cluster of your own.
Trend #4 is the phenomenon of partnering among data companies. Not that partnering is new, of course, but the numbers are going up. Maybe these companies are tiring of each building their own version of everything.
NVIDIA announced partnerships with Kinetica and with H2O.ai, both of which center, of course, on NVIDIA's GPU products and platforms. Kinetica is also partnering with InterWorks around accelerating query performance in Tableau by using Kinetica's GPU-based database, running on NVIDIA's DGX Station desktop product, as an alternate data repository. There's a server story too, with Dell EMC, Hewlett Packard Enterprise, IBM and Supermicro announcing NVIDIA Volta-based server products.
Anaconda and Microsoft have partnered, making Anaconda the official Python distribution for Visual Studio, SQL Server and Azure Machine Learning. And since the Anaconda distro actually includes R as well, Anaconda will use Microsoft R Open for this included offering. This partnership would seem to be a de jure codification of a de facto partnership that already existed. But it's still nice to see it effectuated.
Data Catalog vendor Alation announced a partnership with data preparation contender Paxata. With this partnership in place, data can enter the catalog in Alation as soon as it's ingested in Paxata, eliminating latency from prep to publicized availability.
MapR and C3 IoT are partnering too. And so are a group of companies, including the aforementioned Anaconda and H2O.ai, have aligned into a group called the GPU Open Analytics Initiative (GOAI), that is right now focused on development of a common in-memory Data Frame format for all GPU-based applications. In fact, that Data Frame format is the same one developed by the Apache Arrow team for representation of in-memory columnar data. In both cases, the common format is meant to avoid data movement and format conversion such that one compliant application can immediately pick up data created or processed in another.
New releases galore
In addition to new products announced by the SAP, Microsoft and the distro vendors, a number of the smaller vendors had new releases of their own:
- Actian announced that version 5 of its Vector in Hadoop (VectorH) product will GA in October. VectorH is a vector processing database that uses HDFS as its storage medium. And, much like Splice Machine, VectorH 5 will add an external table facility that allows any data readable by Spark, including Hive tables and flat files (the latter of which on a read-only basis), to be treated as queryable tables.
- Informatica announced (release 10.2) of its Intelligent Data Platform, which includes EU General Data Protection Regulation (GDPR) compliance-related features. It's also compatible with Apache Atlas and Cloudera Altus (mind those near anagrams).
- Attunity announced version 6 of its Modern Data Integration Platform, which includes Attunity Replicate 6.0 and new versions of Attunity Compose for Hive and Attunity Enterprise Manager (AEM).
- Dataguise announced DgSecure 6.2, with multiple-language GDPR support (and, yes, GDPR is trend #5), as well as support for Google BigQuery, Spark-enabled Hadoop clusters (say what now?); and Apache Tez.
- Zaloni announced its Data Master product. Flip that name around and you get what the product is about: Master Data Management, and specifically the data-matching aspect of it, allowing customers to create the "gold copy" of records for master data like customers, storage locations, product categories, part numbers and so forth.
- Iguazio, which made its debut at least year's Strata, announced the General Availability of its Unified Data Platform.
- Even Mr. Pitney and Mr. Bowes are getting in on the action! Pitney Bowes, given its traditional focus on postage, is understandably specialized in address validation and cleansing. As such, it announced the addition of a new Big Data module to its Spectrum solution, which now includes provision of such services natively within Apache Hadoop and Spark.
Come out Hadoop, wherever you are
Well, at least one company still has the guts to call out Hadoop. I'm OK if it's the guys known for postage meters who are embracing the open source Big Data project. They're used to getting real work done with behind-the-scenes technology that enables their business.
The fact is that the less you hear about Hadoop, the more popular it's likely becoming. As a back-end workhorse for huge volumes of data, it's the way to go. And the less we hear about it, the better -- because Hadoop is the thing people shouldn't have to think about. They should just trust it and then forgot it's there. One day, perhaps soon, that will be trend #1.
Tech Pro Research
- Culture, automation and self-service: The keys to big data success
- DataGravity for Virtualization: Discover, analyse, protect and recover sensitive data
- IT leader's guide to big data security
This post was updated at 2:35PM ET to clarify that the GOAI and Apache Arrow Data Frame formats are in fact the same, and not just analogous.