Compute to data: using blockchain to decentralize data science and AI with the Ocean Protocol
The conflict between access to data and data sovereignty is key to understanding how AI works, and moving it forward. The Ocean Protocol Foundation wants to help resolve that conflict, by introducing a way of letting AI work with data without giving up control.
AI and its machine learning algorithms need data to work. By now, that's a known fact. It's not that algorithms don't matter, it's just that typically, getting more data, better data, helps come up with better results more than tweaking algorithms. The unreasonable effectiveness of data.
More data, and more compute capacity to train algorithms that use the data, is what has been fueling the rise of AI. Anyone who wants to train an algorithm for an AI application to address any problem in any domain must be able to get lots of relevant data in order to be successful.
That data can be public data, private data generated and owned by the organization developing the application, or private data acquired by 3rd parties. Public data is not an issue. Privately owned private data must be collected and processed in accordance with data protection laws such as GDPR and CCPA.
But what about private data owned by 3rd parties? Normally, application developers don't have access to those, and for good reasons. Why would you trust anyone with your private data? Even if the party you hand it over to promises to take good care of the data, once the data is out of your hands, anyone can do as they please with it.
This is the problem the non-profit Ocean Protocol Foundation (OPF) wants to solve. ZDNet connected with Founder Trent McConaghy, to discuss OPF's mission and the latest milestone achieved - Compute-to-Data.
Compute-to-Data: if the the data will not come to compute, then compute must go to the data
McConaghy has been working on the Ocean Protocol since 2017. McConaghy has a background in AI and blockchain, having worked in projects such as ascribe and BigchainDB. He described how he realized that blockchain could help solve the issue of data escapes and privacy for data used to train AI algorithms.
The OPF has been working on setting up the infrastructure to enable better accessibility to data via data marketplaces. As McConaghy pointed out, there have been many attempts of data marketplaces in the past, but they've always been custodial, which means the data marketplace is a middlemen users have to trust. Recent case in point - Surgisphere.
But what if you could have marketplaces act as the connector without them actually holding the data, without having to trust the marketplace? This is what OPF is out to achieve - decentralized data marketplaces.
This is a tall order, and McConaghy is fast to admit that it will take years to get there. Last week, however, brought the OPF one step closer, by unveiling what it calls Compute-to-Data. Compute-to-Data provides a means to exchange data while preserving privacy by allowing the data to stay on-premise with the data provider, allowing data consumers to run compute jobs on the data to train AI models.
Rather than having the data sent to where the algorithm runs, the algorithm runs where the data is. The idea is very similar to federated learning. The difference, McConaghy says, is that federated learning only decentralizes the last mile of the process, while Compute-to-Data goes all the way.
TensorFlow Federated (TFF) and OpenMined are the most prominent federated learning projects. TFF does orchestration in a centralized fashion, OpenMined is decentralized. In TFF-style federated learning a centralized entity (e.g. Google) must perform the orchestration of compute jobs across silos. Personally identifiable information can leak to this entity.
OpenMined addresses this via decentralized orchestration. But its software infrastructure could use improvement to manage computation at each silo in a more secure fashion; this is where Compute-to-Data can help, says McConaghy. That's all fine and well, but what about performance?
If algorithms run where the data is, then this means how fast they will run depends on the resources available at the host. So the time needed to train algorithms that way may be longer compared to the centralized scenario, factoring in the overhead of communications and crypto. In a typical scenario, compute needs move from client side to data host side, said McConaghy:
"Compute needs don't get higher or lower, they simply get moved. Ocean Compute-to-Data supports Kubernetes, which allows massive scale-up of compute if needed. There's no degradation of compute efficiency if it's on the host data side. There's a bonus: the bandwidth cost is lower, since only the final model has to be sent over the wire, rather than the whole dataset.
There's another flow where Ocean Compute-to-Data is used to compute anonymized data. For example using Differential Privacy, or Decoupled Hashing. Then that anonymized data would be passed to the client side for model building there. In this case most of the compute is client-side, and bandwidth usage is higher because the (anonymized) dataset is sent over the wire. Ocean Compute-to-Data is flexible enough to accommodate all these scenarios".
From Don't be Evil to Can't be Evil
The OPF has raised funding and built a well-versed team between 2017 and 2020. In order to realize the vision of decentralized data marketplaces, the OPF works in two ways. First, by eating its own dog food, working to develop its community-driven marketplace. Second, by facilitating others to build their marketplaces. McConaghy mentioned examples such as MOBI and dexFreight.
The Mobility Open Blockchain Initiative (MOBI) is a nonprofit organization working with companies, governments, and NGOs. The goal is to make mobility services more efficient, affordable, greener, safer, and less congested by promoting standards and accelerating adoption of blockchain and related technologies. The OPF helps make data and services available to solve challenges to coordinate vehicles, identify obstacles and route autonomous cars.
McConaghy emphasized that the OPF typically does not work directly with users. Its role is to develop the core technology, and empower others to use it. Asked what he sees as the advantages of developing a decentralized marketplace, McConaghy said that it enables organizations to turn data from a potential liability to an asset, without compromising user privacy.
He went on to cite examples such as 23andMe, or Facebook, in which the parties entrusted with the data broke their promises and used the data for nefarious purposes: "Don't be evil mottos can be compromised if companies are incentivized to mine or sell data. What we want to do is Can't be evil".
For end users, however, it may take a while to reap the benefits of the approach. In the path that McConaghy envisions, end users will initially be able to play with existing marketplaces. Step 2 would be to set up data unions, trusts, or co-ops that act on behalf of the users and give them royalties for their data.
McConaghy said that Ethereum-powered DAOs (Distributed Autonomous Organizations) could power such organizations, likening them to sub-Reddits with smart contract-based governance. Step 3, consumer level applications for domains such as social networking, will take a while to appear, McConaghy concedes.
Disclosure: The author has worked on a project with the OPF in 2018, and holds an amount of OCEAN tokens as part of that engagement.