IBM on Monday said its machine learning system, dubbed SystemML, has been accepted as an open source project by the Apache Incubator.
The Apache Incubator is an entry to becoming a project of The Apache Software Foundation. The general idea behind the incubator is to ensure code donations adhere to Apache's legal guidelines and communities follow guiding principles.
IBM said it would donate SystemML as an open source project in June.
What's notable about IBM's SystemML milestone is that open sourcing machine learning systems is becoming a trend. To wit:
For enterprises, the upshot is that there will be a bevy of open source machine learning code bases to consider. Google TensorFlow and Facebook Torch are tools to train neural networks. SystemML is aimed a broadening the ecosystem to business use.
Why are tech giants going open source with their machine learning tools? The machine learning platform that gets the most data will learn faster and then become more powerful. That cycle will just result in more data to ingest. IBM is looking to work the enterprise angle on machine learning. Microsoft may be another entry on the enterprise side, but may not go the Apache route.
In addition, there are precedents to how open sourcing big analytics ideas can pay off. MapReduce and Hadoop started as open source projects and would be a cousin of whatever Apache machine learning system wins out.
IBM's SystemML, which is now Apache SystemML, is used to create industry specific machine learning algorithms for enterprise data analysis. IBM created SystemML so it could write one codebase that could apply to multiple industries and platforms. If SystemML can scale, IBM's Apache move could provide a gateway to its other analytics wares.
The Apache SystemML project has included more than 320 patches for everything from APIs, data ingestion and documentation, more than 90 contributions to Apache Spark and 15 additional organizations adding to the SystemML engine.
Here's the full definition of the Apache SystemML project:
SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark. ML algorithms are expressed in a R or Python syntax, that includes linear algebra primitives, statistical functions, and ML-specific constructs. This high-level language significantly increases the productivity of data scientists as it provides (1) full flexibility in expressing custom analytics, and (2) data independence from the underlying input formats and physical data representations. Automatic optimization according to data characteristics such as distribution on the disk file system, and sparsity as well as processing characteristics in the distributed environment like number of nodes, CPU, memory per node, ensures both efficiency and scalability.