Despite the hype, ML really is in its infancy according to a just-published survey from O'Reilly. The good news is that those who are furthest along with ML are getting quite cognizant of moral and regulatory hazards, like bias and privacy. The survey results drop hints of drastic changes in store as adoption grows more broad-based.
A few months back, we gave our take on a survey from the O'Reilly folks regarding interest in deep learning. The survey reported that interest was more than latent, but there's little question that the bulk of the action today is in the (relatively) better understood confines of machine learning (ML). So on this go round, O'Reilly jumped into the shallower side of the pond to survey the people who subscribe to its publications and go to its big data-related Strata and AI conferences regarding ML.
Before diving in, let's put some perspective on this cohort: it's likely a group that on average is ahead of the curve by virtue of its attendance at these big data events or consumption of O'Reilly learning services that are skewing increasingly toward the AI domain. Nonetheless, it provides a useful counterpoint to their earlier work exploring interest in deep learning.
The least surprising part of the survey is how respondents categorized their organizations' experience with ML: roughly half are in beginners in exploration phase who are just starting to investigate ML. The remainder -- early adopters with roughly 2 years of ML experience and "sophisticated" organizations with at least 5 years or more accounted for 36% and 15%, respectively.
Our take is that if you blew out the survey to a totally blind sample taken from the general population, those numbers would drop considerably. Nonetheless, we'd surmise that these organizations, by virtue of their budgeting for IT/data or analytics-related learning (in the form of literature, courseware, and conferences) are among those who will be spending the lion's share on IT -- and AI and ML in particular.
In the interest of full disclosure, these results are of more than passing interest to us because of the primary research that we're conducting for the day job -- Ovum research jointly sponsored with Dataiku on the people and process side of AI, where we'll be presenting the results at the Strata conference next month.
And one of the findings that piqued our interest was the prevalence of job titles. At 57% of responses, data scientist was by far the most common job title, followed by business analysts and data analyst at 51%, and trailed by data engineer at 39%. The results clearly show the skew of the sample; if you surveyed enterprises in general, we'd doubt that over half of them have data scientists. Moreover, the demand for data engineers is currently more than quadruple that for data scientists, based on job listings from Indeed.com.
But there was an interesting outlier that related to a finding in our own research: the presence of a new job title or role of machine learning engineer, which was reported by 23% of respondents. In our admittedly small and unscientific sample of chief data scientists and chief data officers, we never heard that term. Reflecting the newness of this role, listings for ML engineer are currently half that of data scientists.
When we asked O'Reilly survey coauthor Ben Lorica, he described a role that was responsible for translating and deploying those ML models that R and Python programmers developed, probably on their laptops. In other words, someone who knows the layout of data sources and topology of compute resources -- a blend of data engineer and DataOps practitioner who has a high level knowledge of the difference between a Random Forest and a clustering ML algorithm. And when we looked back at our research, several of the data science heads whom we interviewed spoke of the need for a practitioner with knowledge of data and systems architecture. Just remember where you heard it first.
Regarding development of ML models, the results were at first glance not surprising: the more experienced the organization, the more likely that they wrote their own models: only 12% of organizations just starting to get their feet wet were doing so compared to nearly three quarters of the most advanced "sophisticated" group. But given the popularity of open source libraries and frameworks, like TensorFlow, Keras, Scikit-learn, and the CRAN portfolio, and others, chances are they weren't completely starting from scratch.
However, we suspect that with the new availability of cloud services promising to democratize AI, from Amazon SageMaker to Azure ML Studio and Google Cloud AutoML, that the actual proportion of internally developed models might actually drop over time as non-data scientists get in on the fun. Nonetheless, given the newness, these are still early times for cloud ML services in general, as shown by the fact that only 3% of respondents are reporting using these services.
What was encouraging was the surprisingly high proportion of respondents (almost half) among the more experienced "sophisticated" group reporting that their organizations were at least aware and are starting to vet their models and data samples for bias, fairness, and privacy. Of course with the onset of GDPR, their organizations are probably not giving them a choice. And also, this group, being the most elite, should have more awareness -- the open question is whether such awareness will spread as adoption spreads to the more general enterprise population.
But for every potential problem there is opportunity. Cloud and data science tool providers, take note: the need for improving the quality and fairness of modeling, data sampling, and privacy protection provides an opportunity for the ecosystem to harness ML to help automate the flagging of the warning signs.