The great data science hope: Machine learning can cure your terrible data hygiene
Will there ever be a technology that can fix decades of poor data hygiene? Probably not, but that isn't going to stop technology vendors from trying. The good news: Machine learning may come closest to saving your data management hide.
Data hygiene isn't easy. You can't hire enough interns to even come close to rectifying past mistakes. The reality is enterprises haven't been creating data dictionaries, meta data and clean information for years. Sure, this data hygiene effort may have improved a bit, but let's get real: Humans aren't up for the job and never have been. ZDNet's Andrew Brust put it succinctly: Humans aren't meticulous enough. And without clean data, a data scientist can't create algorithms or a model for analytics.
Luckily, technology vendors have a magic elixir to sell you...again. The latest concept is to create an abstraction layer that can manage your data, bring analytics to the masses and use machine learning to make predictions and create business value. And the grand setup for this analytics nirvana is to use machine learning to do all the work that enterprises have neglected.
I know you've heard this before. The last magic box was the data lake where you'd throw in all of your information--structured and unstructured--and then use a Hadoop cluster and a few other technologies to make sense of it all. Before big data, the data warehouse was going to give you insights and solve all your problems along with business intelligence and enterprise resource planning. But without data hygiene in the first place enterprises replicated a familiar, but failed strategy: Poop in. Poop out. And you wouldn't want to make your in-demand data scientists deal with poo.
TechRepublic: Cheat sheet: How to become a data scientist | Job description: Data scientist (Tech Pro Research)
IBM's Seth Dobrin, chief data officer for IBM, said "the idea that you could use a data lake and Hadoop (MapReduce) instance where you can dump all this crap in is a mistake." Not too surprisingly, IBM has its Watson Data Platform and a series of tools that use machine learning to clean data, append meta data and make connections between data stores. IBM's data platform sounds like a mix of middleware and operating system, but you get the idea. IBM data platform will also recommend models and algorithms.
Other vendors in the space include Alation, Io-Tahoe as well as Cloudera and HortonWorks. While the approaches vary, the general idea is to use machine learning to make data more usable. Ovum's Tony Baer, also a ZDNet contributor, is betting that this data abstraction layer will be a key 2018 trend for big data, data science and machine learning.
- IBM enhances Watson Data Platform, with an eye towards AI
- How to build a data science team
- Tableau extends its footprint
- Strata 2017 Postmortem: More virtual data lake, more operational machine learning
- Strata NYC 2017 to Hadoop: Go jump in a data lake
- Data lakes going the way of the visual spreadsheet?
- Hortonworks DPS reaches out to the virtual data lake
Know this: Every technology vendor you have will have some spin on this data abstraction layer to pitch AI and analytics. Also know this: You'll listen since your data hygiene has been terrible and you need a bail out.
Salesforce at its Dreamforce powwow preached the democratization of artificial intelligence and analytics. Salesforce's Einstein platform will provide a bevy of insights. Data hygiene presumably won't be a problem since the enterprises that go with Einstein have most of their data with Salesforce.
Data science: Feeding the all-seeing beast | ZDNet Academy: Introduction To Data Science: Lifetime Access
And Salesforce isn't alone. One argument for the cloud is that data can be standardized and live on one platform and data model. Substitute Oracle, SAP and Workday for Salesforce and the concept is basically the same. Microsoft has its Common Data Platform. In the end, the subtext is the same: Dear enterprise put all of your data with us.
I noted how the Internet of things and cloud muddy the data ownership waters a few weeks ago. Now it's worth pondering what vendors will own your queries. IBM is betting that its open strategy will win the day and be that abstraction layer to multiple data stores (with cleansing on the fly). Toss Tableau in the mix to own your queries. We'll see. The only certainty is that data hygiene will be an ongoing issue that scales.
ZDNet's Monday Morning Opener
The Monday Morning Opener is our opening salvo for the week in tech. Since we run a global site, this editorial publishes on Monday at 8:00am AEST in Sydney, Australia, which is 6:00pm Eastern Time on Sunday in the US. It is written by a member of ZDNet's global editorial board, which is comprised of our lead editors across Asia, Australia, Europe, and the US.
Previously on Monday Morning Opener:
- After the iPhone X: Predicting the future of the smartphone
- Businesses need to think about a public cyber star rating
- Why CIOs have bigger IT budgets for 2018, and what they're buying
- iPhone X: Sorry Apple, but I just can't face using Face ID
- Far from deja vu, Google yet again repeating history
- It's now or never for wireless charging
- IBM's Watson Data Platform aims to become data science operating system
- Beyond the iPhone: How Apple is positioning itself for the next big thing
- Big data and digital transformation: How one enables the other
- Amazon doesn't sweat competitors, but every other company needs to answer the Amazon question
- Chromebooks will not move into business if they cannot be bought
- The real success of AI will only come with treating workers well
- Public cloud, private cloud, or hybrid cloud: What's the difference?
More: