Video: Big data from mobile devices could be used to stop the spread of disease
Getting data.world is not very easy.
That's because data.world seems to be working on the intersection of a number of things. What is it that it does exactly, how, and why?
As of today, data.world is officially releasing the enterprise version of its platform, and ZDNet had an in-depth discussion with data.world's team to address those questions.
From magic spreadsheets to massive graph database as a service
To understand where data.world is coming from, we took a step back to discuss data-related issues the team has had to deal with regularly. CEO Brett Hurt and CPO Jon Loyens co-founded data.world and referred to their experiences in previous roles in enterprises such as Bazaarvoice and Homeaway. Loyens, for example, referred to his magic spreadsheet, which should ring a bell:
We had a data lake, and a data warehouse, and SAS, and we also had to integrate data from external sources to get our quarterly targets and forecasts. And it all went into this epically huge spreadsheet that would tell us the magic numbers we'd have to hit.
When I had to pitch that to a team of engineers, often times some would want to question my data and assumptions. And all I could tell them was, 'Well, here's the magic spreadsheet... good luck.'"
Loyens added that at the same time ex-colleague and data.world CTO Bryon Jacob was struggling with data management. So, at some point, this motivated them to join efforts and form the team now led by Hurt. The team has been working since 2015 and raised a total of US$ 33 million.
Data.world describes its mission as democratizing access to data and helping tap into more of your team's collective brainpower to achieve anything with data faster. That should also ring some bells, as it sounds like something a data science notebook or a self-service BI tool or a Hadoop vendor or a data lake platform provider could all be pitching.
So, what is data.world again, and how is it different from those? Loyens said there is a lot of emphasis on infrastructure, and a lot of emphasis on analytics, but how to get from one to the other is not clear. Their take on this was to build a massive graph database as a service, add layers on top of it, and focus on collaboration and social aspects.
Tim Berners Lee inside
This sounds pretty generic, except maybe for the graph part. But for Hurt, this was the biggest strategic unlock for their business model and how they work with communities:
"The secret sauce in what we do is we're built on top of the Semantic Web and Linked Data. This is how the network effect of what we do kicks in. We are able to connect people to datasets they may not have even thought of, and it makes the world smaller," Hurt said.
Data.world is vocal about its use of this technology, but it also keeps a pragmatic stance. While it refers to how Linked Data technology lends itself very well to data integration and breaking down silos, they acknowledge the two most common criticisms of Linked Data: Accessibility and scale.
Part of data.world's mission is to make data discoverable, and while Linked Data may be a good match for this, it's not really considered accessible by data scientists or analysts.
"We've heard about Linked Data -- great promise, but it's hard to use, and hard to annotate," This is something data.world heard from users over and over, and its way of dealing with that was to abstract as much as possible from the specifics of using Linked Data to ingest and publish datasets.
When data is ingested or published, they are introspected and annotated by data.world using Linked Data standards and vocabularies (most prominently, RDF, SKOS and CSVW). Loyens said they make it easy for people to work with data in tabular formats they are familiar with, and have built things such as a SQL - SPARQL bridge to democratize access.
At the same time, data.world provides access to the underlying formats and technology for the ones that want it. Hurt referred to how this aligns with the vision of Linked Data Tim Berners Lee has been promoting, and added he met TBL and he "loved what we do, and now has our sticker on his laptop wherever he goes."
Is this another data lake?
Celebrity endorsement is always good, but it won't get you too far if you have scale issues. Loyens said their take on that was to adopt Apache Jena, and more specifically, a part of it that was an abandoned academic project and pick it up. Having hardened it, Loyens added they intend to re-release it as open source soon.
Although the graph database space is booming, Loyens made it clear they do not intend to address this market. He believes a core part of the value data.world adds is the managed service part, and it would be hard to replicate that as a stand alone offering.
Data.world may be built on a graph database, but it's not in the market as one. Similarly, it may sound like a data science notebook, but it really isn't one. Loyens said while notebooks are code-centric, data.world is data-centric. There is value in both, he added, and data.world integrates with notebooks.
Data.world seems to be addressing a wider audience than data scientists too, including analysts and line of business. The vision is to enable a diverse group of people to interact around data and analytics to provide value. We have seen similar efforts, but it does not look like they are really catching up. IBM's DataWorks for example is long gone.
Data.world by contrast boasts clients such as Associated Press (AP). Data.world said it has helped AP with some of their biggest stories lately, by enabling AP and partners to collaborate on data analysis.
In terms of integrations, data.world integrates with Python, R, Microsoft Power BI, IBM SPSS, and MicroStrategy, among others. The team emphasized that the integrations were done on data.world's API, with no involvement on their side. The idea is to let users do the analysis on whatever tool they choose, and use data.world for the orchestration and collaboration part.
This approach is also reminiscent of a data lake. When asked whether they would encourage users to replace their data lake with data.world, Loyens said this is really up to users. Data.world can ingest metadata only, or data as well, operating side by side with data lakes or taking up their role.
To get the bigger picture about data.world, a few more things should be noted: Besides getting TBL and the semantic web crowd on board, data.world is being vocal about more than technology. Data.world is also leading the so-called data manifesto effort being unveiled today.
Hurt describes this as the equivalent of the Agile manifesto for data. He emphasizes his belief that a manifesto like this is needed to drive this new domain, mentioning for example the issue of data bias.
The data manifesto is built on a set of principles and values, and Hurt takes pride in having some of the heavyweights in data science co-author or sign the manifesto. This includes DJ Patil, whom Hurt met during Patil's stint as the White House Data Scientist.
Patil has co-coined the term data scientist, and he will also serve in data.world's advisory board. Data.world is set up as a public benefit corporation, and Hurt aid he sees this as the next step towards corporate responsibility. Hurt, a serial entrepreneur and investor, is also involved in evangelization on data.driven culture and the future of capitalism among other things.
Data.world seems like an odd and interesting approach. Its mix of idealism and pragmatism is quite unique, and its team really seems to be standing behind it. Judging from what they've accomplished so far, it may take them even further.
Previous and related coverage
Data.World has secured $18.7 million in funding, bringing the total amount raised by the Austin-based startup to $32.7 million.
The great big data land grab is on, and the Internet of Things is going to make ownership even more complicated. Get ready for a few ownership spats as data becomes the new oil.