Three Big Data themes at Strata/Hadoop World NY

Big Data's fall season is upon us, and Strata+Hadoop World NYC is its coming out party. There will be a multitude of announcements, but more than likely a manageably-sized set of key themes. Here's a few to consider.

The data industry is emerging from its slower summer pace and getting ready to hit the ground running, with new products, releases and initiatives for fall. Strata+Hadoop World in New York City, which takes place next week, serves as the launch party for the industry's new fall season.

The amount of news to process from the event will be large, but a great number of the announcements that will be made will fall into a few buckets. Thinking about those categories now will help you process the news next week, so let's take a look at some likely themes.

Spark vs. Hadoop
Spark has been a big theme at many of industry events this year and last, and there's not much point in highlighting it as a theme again. This year, the hot question will be whether Spark's future is one where it will be predominantly deployed running on Hadoop, or where it will run in a more standalone mode.

As I wrote about recently, Cloudera sees Spark and Hadoop as inseparable partners, and it's hard at work in making that partnership stronger, technologically. Databricks, the company whose founders created Spark, sees the technology's independent identity as the more important one. In fact, Databricks yesterday released results of a survey it conducted that the company says show the number of standalone Spark clusters has exceeded the number of Hadoop-based clusters running Spark.

Of course, Databricks Cloud, the company's hosted Spark offering, itself runs independently of Hadoop, so validating that architecture is in the company's interest. And for users just starting out with Spark, the standalone configuration is likely easier to get working, so it's not surprising that that, in the technology's early market stages, the number of observed standalone clusters is relatively high.

While Big Data may be past its hype cycle, IoT (Internet of Things) most certainly has not. And unless you're interested in the rigors of building sensors, or writing the code to read them, then IoT is really all about streaming Big Data processing. Even if IoT isn't a concern, doing analytics in real-time is. The so-called Lambda Architecture, which seeks to blend streaming/real-time and batch data processing into a single query and analytics environment, is gaining momentum.

That's all well and good, but the landscape of open source streaming data platforms (including Storm, Spark Streaming and Flink) as well as the message processing substrate beneath it (including Kafka, RabbitMQ and a host of proprietary on-premises and cloud-based solutions) is bewildering.

Even when the platform shakes out, there is still the matter of making these solutions easier to use. We're a long way from that, and that seems the next likely region of white space the industry can fill.

Data Governance
The hills are a live with the sound of...governance? It's true. Data lineage, fine-grained security, data quality management, metadata management, and the ability to audit administration of these features as well as general query and data manipulation activity, have become a priority for vendors, and for buyers.

Chalk this one up to the gold rush for Enterprise sales and adoption, then look back and realize how ludicrous it is that this wasn't already a priority, and you'll start to get the urgency here. Big companies have big regulatory obligations around their data, and governance features need to be there to help assure compliance.

This theme is so important, that some vendors may focus on governance, even if they don't have major new capabilities to announce. Why? Because the vendors want their customers to know they're thinking about it. Empathy is important.

What else?
I hate going to Javits Center. It's all the way on the west end of Midtown...what was once a very gritty part of Manhattan. It was always hard to get to, too, especially at peak times, when taxis were scarce. But if you hopped on the M34 bus, which made its last stop at Javits, you could make it work. Thankfully, though, New York City Transit finally opened a new subway station, on the #7 line, right at Javits. Wait long enough, and things get simpler.

Similarly, the Big Data world is slowly shifting its own focus from the modest goal of making certain analyses possible to the more audacious goal of making them easy (or, at least, easier). That overall ethos underlies all of the above three themes. Spark's micro-batch architecture makes processing data easier than does MapReduce's batch architecture. Bringing streaming data into the mainstream makes real-time analytics easier. Governance makes compliance easier.

Ultimately, data analysis makes business easier, even fun. Unfortunately, that's not a word I'd ever apply to going to Javits.