As the advent of machine learning continues to disrupt a swathe of industries, one of the things that is becoming increasingly clear is that machine learning needs lots of high-quality data to work well.
According to the findings of a recently released survey, 99% of respondents reported having had an ML project completely canceled due to insufficient training data, and 100% of respondents reported experiencing project delays as a result of insufficient training data.
Using synthetic data is one approach to get around the issues associated with obtaining and using high-quality data from the real world. Today Rendered.ai announced the availability of its Platform as a Service offering for synthetic data engineers and computer vision scientists.
Rendered.ai touts its platform as the first of its kind platform, and a complete stack for synthetic data including a developer environment, a content management system, scenario building, compute orchestration, post-processing tools, and more.
We caught up with Rendered.ai Founder and CEO Nathan Kundtz to learn more about the use cases the platform can serve, and how it works under the hood.
Quality data for AI models is hard to come by, and expensive
Kundtz, a physicist by training, has a Ph.D. from Duke University. He also has previous startup experience, having founded and successfully handed over Kymeta. Kymeta is a developer of hybrid satellite-cellular networks, and Kundtz kept hearing about the challenges people in the satellite industry were having with data.
He put his thoughts on how to possibly address those challenges in a whitepaper, which he shared with a few people. Some of these people decided to work with him, trying to build tools that could help people in the satellite industry, particularly in remote sensing. That led to starting Rendered.ai in 2019.
Kundtz referred to remote sensing as involving imagery of "cities being built, patterns of life, crops, forestry, etc from space". That squarely falls under the category of unstructured, visual data. But that's not all Rendered.ai can produce.
Visual data can refer to the type of imagery that comes from cameras, but it can also refer to things such as X-rays. Rendered.ai also does radar and many other different sensing modalities that can ultimately be translated using computer vision tools. The platform can also be used for non-visual data, such as tabular data, audio data, or video data.
Kundtz highlighted a use case in which Orbital Insight worked with Rendered.ai as part of a National Geospatial-Intelligence Agency Small Business Innovation Research grant. Orbital Insight demonstrated improved outcomes for object-detection performance through the use of synthetic data.
Rendered.ai helped them to modify synthetic images, so the trained AI model can generalize to real images. They also helped use the combination of both a large set of synthetic images and a small set of real examples efficiently to jointly train a model.
As Kundtz noted, to make images relevant for computer vision, it takes more than the images themselves. Images need to be annotated, to properly label depicted items that need to be identified by AI models.
To annotate a 200-kilometer swath in RGB photogrammetry can cost upwards of $65,000, Kundtz said. And that does not necessarily include all the objects that the people sponsoring the annotation would like to train AI models to identify. The idea behind synthetic data is to generate data that is realistic enough, but at the same is guaranteed to include everything that the AI model needs to learn, and comes pre-annotated, therefore lowering cost.
Approximating the real world
Rendered.ai applies what it calls a physics-based approach. What this means in practice, as Kundtz explained, is that they apply physics-based simulations to approximate real-world behavior well enough to generate useful data. There are other ways to generate synthetic data, but Kundtz believes none of them works as well.
GANs (Generative Adversarial Networks) is a common method used to generate synthetic data. Essentially, we provide a lot of images and then teach an algorithm to make more like what we already have, as Kundtz put it. The trouble with GANs, he went on to add, is that you're not introducing any new information. You produce make of what you already have.
Another method to produce synthetic data is using video game engines. There's a lot of physics in that, and Rendered.ai uses them too, Kundtz conceded, but it's rather narrow in scope. He believes that this approach doesn't lend itself to the wide range of use cases that people need synthetic data for. Plus, game engines are not at the point where they're indistinguishable from reality, and sometimes that can have an important effect on algorithms.
What Rendered.ai has done, Kundtz said, is to make its platform extensible to a wide variety of different simulation types, and then build partnerships with the companies that have deep expertise in those areas. Not just working with video game engine codes, but embedding deep physics knowledge.
In any case, it's not about simulating the real world, but rather simulating the mesh that you can create of the real world. By definition, the simulation is not going to capture 100% of the fidelity of the real world. This means that you need to do two things, Kundtz noted.
The first is to overcome gaps with respect to reality, to avoid introducing artifacts that can confuse AI models. The second is to apply post-processing effects, to help overcome the so-called uncanny valley and improve realism.
Rendered.ai's platform has two main components: a developer framework, and a computer orchestration librarianship environment. "Anything you can script with Python, you can put into that developer framework", as Kundtz put it. There is also a visual layer, a no-code environment as Rendered.ai calls it, which enables people to generate workflows without manually typing everything.
But the heart of the approach lies in what Rendered.ai calls "the graph". This is a visual way of defining different types of objects, their properties, and interdependencies:
"The graph does not just define a piece of data, one image or one table, but a stochastic approach to generating them. So you can use that graph to continually generate additional data within some domain", Kundtz said.
In this context, Rendered.ai defines the roles of the synthetic data engineer and the computer vision engineer. The synthetic data engineer is the person who's writing scripts that define what is going to be possible from different graphs. The computer vision engineer ingests graphs and determines what are the things they want to see in a particular dataset.
Collaborative platform, compute included
Kundtz also elaborated on the process and the tools used to introduce a certain amount of randomness where necessary. This can be useful to ensure that the data reflects the real world, and also to generate edge cases and test different scenarios.
Rendered.ai claims part of the innovation its platform introduces is precisely the definition of those different roles in the process, along with the collaboration infrastructure to support them. Most simulation tools and 3D modeling and game tools are built around a single user, but synthetic data is fundamentally multidisciplinary, Kundtz said.
The onboarding process for Rendered.ai typically starts from existing code, which is then modified to fit each client's needs. Kundtz acknowledged that it's early days for synthetic data, so educating clients and helping them experiment is part and parcel of Rendered.ai's mission.
What helps in that respect is the fact that getting a Developer or Professional plan, for $500 / month and $5000/month respectively, comes bundled with computing on AWS. Although some restrictions in instances do exist, the idea is to empower users to run the experiments they need without worrying too much about their AWS bill. There is also a free tier available to test the platform.
Rendered.ai, which received $6 million in seed funding in 2021, has already released an open-source application and related content to help onboard users to its platform. Kundtz mentioned they will be releasing additional open-source applications and content for more domains, in an effort to onboard more users.
"We can do a lot to help people in this industry. And I think this is one of the most important problems facing AI, if not the most important problem. So I'm excited to be able to help out", he concluded.
Note: The article was updated on Feb 4 2022 to correct Rendered.ai funding round date, and the names of their subscription levels, which were previously erroneously reported.