Like agile software development, data science works best when models can be tested and iterated rapidly. The latest release from data science collaboration tool Dataiku adds integrations with GitHub and HipChat that will not only bring developers and data scientists together, but hopefully, some of their DevOps discipline as well.
Given the hype over data science, and now machine learning, it shouldn't be surprising that collaboration platforms are one of the fastest emerging segments for big data tooling. The guiding notion is that data scientists are relatively few and far between, and that their talents should be used as productively as possible. In our 2017 Ovum Trends to Watch forecast for big data, we identified that making data science a team sport would be a front burner issue.
There is great variety in the types of tooling that have emerged; at one extreme, there are ambitious offerings, such as IBM's Data Science Experience, that are all-encompassing. You can access notebooks, manage data science projects over their lifecycle, schedule compute runs, and track data lineage. There are other tools, such as Domino Data Lab, that take more of a life cycle management emphasis for data science projects.
And, among dozens of others in the field, there is Dataiku, which targets data scientist productivity.
Dataiku is releasing the 4.0 version of its data science workbench this week. A key theme for the new release is enabling teamwork and rapid iteration of model and algorithm development. Specifically it adds integrations with GitHub for software code commits and channels such as HipChat, which has grown popular with developers; and Slack, which has become a popular chat platform for business professionals.
Tapping into these channels is a realization that it's fine for data scientists to work with notebooks. But if they are to effectively collaborate with developers and data engineers who are responsible for deploying the models to production, or with line of business people who must understand what value the model holds for them, they need to reach them on the channels where they live.
It's not just a matter of getting everyone on the same page; it's also about velocity. Tapping into GitHub provides the link the software development lifecycle. For organizations steeped in agile practice, support of rapid deployment best practices such as DevOps makes rapid delivery real. Developers and operations collaborate on an ongoing basis to ensure that software that is iteratively changing gets deployed rapidly without destabilizing the underlying systems.
It makes sense to extend this notion to data science because the definition of "science" means continual experimentation to until the best hypothesis is identified. It makes even more sense when the goal of data science involves developing machine learning models, because learning is all about evolving in response to changing scenarios. In both cases, the underlying need is to ensure that change does not destabilize the underlying system.
That's the idea behind DataOps, which, as Tamr CEO Andy Palmer defines it, is "a data management method that emphasizes communication, collaboration, integration, automation and measurement of cooperation between data engineers, data scientists and other data professionals."
We're not going to pretend that an integration between a data science tool and code development hub will make DataOps happen; at the end of the day, it's all about people and process before it gets to technology. But it makes sense for tools like Dataiku to provide the integrations that can make those process links happen.
How to secure your data when you have non-traditional perimeters: