IBM Machine Learning brings Spark to the mainframe

Yes, Big Iron can do Big Data and Machine Learning, even while it keeps chugging away at its appointed transactional tasks. In fact, putting the two together makes all kinds of sense.
Written by Andrew Brust, Contributor

New York City's historical buildings, businesses and sensibilities - some of them decades old - have been under siege for the past few years. One of the latest victims is the historic Waldorf Astoria hotel, which is closing in less than a week for renovations, possibly not to reopen for three years. Reportedly, 300-500 hotel rooms will remain, but the vast majority of the property will be converted to luxury condos. Wouldn't it be better to onboard the new condo "workload," without stripping away so much of the Waldorf's legacy hotel functionality?

There's a data and analytics angle here, I promise. Just last week, in the waning days before the Waldorf's closure, IBM held an event there, and made its own announcement that, ironically, proves the coexistence of new and old workloads has efficacy. My ZDNet colleague Tony Baer (a fellow bemoaner of lost NYC treasures) was there, and he gave me the deets. With his permission, I got to write this up, and provide my own analysis.

Just as lots of people have enjoyed staying at the Waldorf Astoria as hotel guests in modern times, many businesses still run their mission-critical transactional workloads on mainframe computers. The risk and business operational disruption that would result from moving these systems is too great for most companies that have them. But as new workloads become equally germane, what's a mainframe vendor to do? IBM's solution: announce support for machine learning on Z-series mainframes.

Spark goes big iron
This move makes sense, especially for company like IBM, which still derives significant revenue from sales and maintenance of mainframe computers. But it makes sense more generally as well: if so much transactional processing is still happening on mainframes, then building predictive models on their data is imperative for any digitalized, or digitalizing, business. And while exporting the data to more modern systems, in order to do feature engineering, as well as model building, testing and scoring, may seem logical, think again: data movement is costly, time-consuming, and may transgress data security policies that are in place.

Also read: IBM launches Apache Spark cloud service

IBM's solution then, is a hybrid approach. First, build a Linux cluster on which to run data ingestion from external sources, transforming, pipelining and serving up Jupyter notebooks. Second, add IBM Machine Learning, a mainframe-based, fit-for-purpose, federated platform dedicated to machine learning, that keeps the data in place. It uses the mainframe's zIIP (System z Integrated Information Processor) that was designed for BI and analytics workloads on the mainframe without incurring MIPS charges.

All execution is dispatched to the mainframe, to avoid bringing data to the processing. To do this, IBM has essentially ported Apache Spark 1.6 to its Z-Series platform, including Spark MLLib, Spark SQL, Spark Streaming and GraphX. IBM will also include a curated set of machine learning libraries that it has developed, and in the future, will include other models and frameworks from the open source community, such as TensorFlow.

Petabytes, schmetabytes
Tony did express to me some concerns that data volumes on the mainframe are more gigabyte-scale than terabyte- or petabyte-scale, and that this could lead to insufficient training data with which to generate really accurate models. But in my usual zeal to contradict him, I'm not so concerned about that. After all, ML technology has been around for decades, albeit under the "data mining" moniker, and was originally designed for smaller data volumes.

True, today's models are sometimes being fed by high-volume, real-time behavioral or event-driven data, often delivered by IoT devices. These models tend to have very high accuracy as a result, and today's streaming data technology can make it all work. But the whole inspiration for mainframe ML is to build models on transactional data. And transactions -- by their very nature -- are discrete events of lower volume, fed by underlying behavioral data. Customers need the models built on the transaction-level data, so IBM might as well make mainframe ML a reality. Besides which, the modeling, testing and scoring will be computationally less taxing with more granular data, and since these computations are taking place on the shared mainframe, shorter, less complex jobs are likely a blessing.

Fit and finish; working to strengths
There are still a few pieces to come, most notably data transformation functionality, crucial to dealing with the mainframe's rarefied data layouts, and Jupyter notebook support for languages other than Scala - for example, R and Python. The data transformation capabilities will be provided by Rocket Software, which seems a better answer than going and hiring a small team of IBM Global Services consultants to code the work by hand. As for further notebook coding support, it's a good bet that will come soon for at least one language other than Scala.

Yes, old and new can coexist. And for authentic relevance, it's best that they do. For vendors like IBM, whose market presence spans multiple generations of technology, these sorts of mashups seem spot on. If Microsoft can put R in SQL Server, then IBM can, and should, put Spark on the mainframe. It's a matter of playing to strengths.

Now if we could just convince the new owners of the Waldorf to think that way...

Editorial standards