Whenever machine learning is applied to a given field of industry, defining the objective is one of the most salient questions. Just what are you trying to find out?
In the biological sciences, that can be a very tricky question, as six-year-old startup Recursion Pharmaceuticals has learned from experience. The amount of data expands rapidly and knowing how to frame an objective that yields insights from the data is something of an art.
"It's still a big data problem," says Mason Victors, chief technologist of Recursion, which is based in Salt Lake City, Utah. "We have chosen a very large swath of biology on which to develop our platform over many years," he said in an interview with ZDNet. And that brings challenges.
The company gathers 65 terabytes of data per week, which it stores in Google's cloud computing facility. Recursion has amassed roughly 2.5 petabytes of information in a little over four years.
Recursion is trying to do two things that are complementary but also hugely ambitious. Nominally, Recursion's mission is to find cures for diseases, in a way that cuts down on the costly pipeline of development.
The grander, "two-decade" vision for the company, as co-founder and chief executive Chris Gibson explained in an interview with ZDNet, is to be able "to predict how any molecule, large or small, will affect any state" of the cell. It is what he and Victors refer to as a map of all human cellular biology, as many details as possible about the "morphology" of cells, their shape and structure.
Recursion has gotten some substantial funding for that very large data science project. It recently scored $121 million in venture money in a Series C round led by British investment fund Baillie Gifford, for a total of $200 million in investment to date.
Searching for treatments while also managing the ambitious project of creating a map of all human cells is a balancing act, where the objective function can be simple, but the data management can be extremely complicated.
It begins with a procedure called "cell painting" that covers the cells in as many fluorescent dies as possible, to bring out aspects of the structure of the cell. Cell painting was developed by Anne Carpenter of the Broad Institute of MIT and Harvard in Cambridge, Mass., who runs the Carpenter Lab there. The software she created, "CellProfiler," is available for download for free.
Painting the cell goes beyond the typical "screening" of cells, which aims to pick out a handful of features. Instead, the process of creating a "profile" of a cell quantifies hundreds or thousands of characteristics about the structure of a cell that can then be introduced as input to a machine learning model to in turn find features of interest that change with perturbations. The perturbations could include something like altering a cell's RNA to see how it changes the structure of the cell.
Gibson first discovered Carpenter's approach when he was pursuing a PhD at the University of Utah. "It's a fancy way of taking pictures of cells," says Gibson, but it was also something of a revelation to him at the time. He recalls using the Western blot technique to explore a condition called "cerebral cavernous malformation," or CCM, where blood vessels in the brain become deformed, which can lead to the equivalent of a miniature stroke. The Western blot approach was tedious, examining one protein at a time.
"We had become familiar with Carpenter's approach, where she was able to feed things into a machine classifier," he recalls, and automate the examination of many molecules all at once. Gibson and his mentor, Dean Li, then professor of medicine and biology at the university, tried out the approach. Cell painting was able to confirm some hunches for Gibson in the traces of CCM, but also, "it was seeing something I wasn't seeing," he said when applying machine learning to the information-rich images. Gibson joined with Li to found Recursion on the premise that rich pictures of cells could yield original insights that regular screening couldn't. They were joined by a third co-founder, bioinformatician Blake Borgeson.
Carpenter serves as a scientific and technical advisor for the company. Other advisors include famed deep learning researcher Yoshua Bengio, head of Montreal's prestigious MILA institute for machine learning, and one of the three recipients of this year's ACM Turing award for lifetime computer science accomplishment, along with Yann LeCun of Facebook and Geoffrey Hinton of the University of Toronto.
From the cell paintings, machine learning is applied to tease out some basic relationships that may be significant. "What matters is what is the task you train the network on, how do you find the things you care about," says CTO Victors, who holds a master's in mathematics from Brigham Young University, and who has served as a data scientist at previous startups.
A straightforward question can be, Do these cells look the same? "You feed triplets of examples of cells to a network, and two of them should be similar, and a third should be different," he explains. The triplets are the result of encoding the cell painting's features as "embeddings," or what Victors calls placing them in "latent representation space." Some very simple approaches in statistics can be used, such as measuring "angular distance" between the features of the different cells.
"We have found a lot of traction in modeling things geometrically," he says. "Angular distance is really a useful metric as opposed to Euclidean distance."
But just measuring features isn't enough, which is why the company maintains a "wet lab," where perturbations can be tried out in vitro to see how a given molecule responds to a compound. The dance of teasing meaning out of giant data is a big thing that sets the company apart from a raft of startups in the AI of biology and medicine, says Victors.
"Other groups in drug discovery are handcuffed to existing static data sets they have no control over," observes Victors, whereas Recursion is generating new data constantly. Because of that, he insists, the company can not only train but also validate machine learning models with greater care.
"It comes down to the ability to generate data at an incredibly massive scale and also generate it in a tight feedback loop," he says. "It often involves a very tight collaboration between the data scientists, the machine learning experts, and the life science experts as well, to figure out how we actually model the biology itself, and what the impact of that is going to be on the analyses we adopt."
"From a business standpoint, it lets us rapidly go after potential drug candidates in a really effective way," says Victors. "We can run an experiment to generate data to see whether we think this compound is potentially effective, and then if we do, go for a deeper study with increasing doses, and more replicates, to verify that across other disease reagents to see if we see similar efficacy there."
"We don't have to outsource all that," he notes of the in-vitro testing and screening, "and so we can eliminate the longer wait times and the cost it would bring."
It's not just having a wet lab, says Victors, but also "all the engineering infrastructure that has to be built to handle the amount of streaming data," the big data challenge, in other words. "It's about how you process that data, transfer it up to the cloud, store it there, it's about having scalable distributed systems, and then returning the data in a suitable format for one-off or ad-hoc analyses -- all of that is also a big challenge because of the overall scope and ambition of what we're trying to accomplish."
Having control over the data is important because the company can be mindful of how the data distribution changes over time. "As we refine our biological tools we use, to be more specific and selective, this can lead to a different distribution than in past," observes Victors. Knowing the "vintage" of data, if you will, the company can adjust its analysis to take into account how that drift may affect machine learning. Because much of AI is affected by small statistical variations in the data, being cognizant of things such as distribution shifts may play a role in getting useful analysis out of the model.
One outcome of the big data efforts is a new, publicly available data set that Recursion released in May, called the RxRx1. It consists of 300 gigabytes of over 100,000 images "representing diverse biological contexts." Recursion hopes the data set will spur outside researchers to develop new machine learning techniques. It was announced at the International Conference on Learning Representations that month.
Most of what Recursion needs to do in machine learning today, such as the angular distance of triplets, doesn't require deep learning forms of AI Instead, it can be done with very basic tools. "The deep learning approach is not the majority of the work we do here," says Victors. "We find complementary signal there, but the standard approaches get you 90% of the way there."
There are problems with deep learning, he notes. A "variational auto-encoder," a popular form of unsupervised deep learning, can be problematic because it's not selective enough.
"Any time you generate biological data, you have batch effects," notes Victors. "These are nuisance factors that are just due to the experimental process itself -- say, the temperature was different this time, the humidity was different, or the cells were treated longer than the prior time."
A variational auto-encoder "would also be learning how to represent those batch effects in the representation, which you don't want," he notes.
The process of perturbing a given molecule and seeing what happens sounds a bit like what's known as "reinforcement learning" in the machine learning field. As Victors describes it, there is a "state-action" model, the same concept as in reinforcement learning. "We use our images to represent a snapshot of cellular state, and then we can act on those cellular states by introducing perturbations, and learn the meaning of actions."
But, he hastens to add, it's "quite different from reinforcement learning in many ways -- it's more than learning the state-action relationship, we have to make sure the data going into those functions is paired appropriately."
Over the long term, there is a role for deep learning in creating that unified map of cellular biology, he opines.
"One area we expect deep learning to be really effective at is in creating a universal latent representation space, a space where all your data resides, where you have unlearned the things you don't want to know, and learned only what you want to know, experiments across time and across different conditions, to have distance and similarity mean something in that space -- that is still an area of active research for this purpose."
All of it comes back to the clinical utility of discoveries, says CEO Gibson. "Reagents are not perfect, they are messy, and we have to have a pretty strict threshold" of statistical confidence in what the computer finds, he notes. "I'm worried that there is some over-fitting happening a lot in the industry," he says of machine learning in biology. "There is a lot of machine learning being applied to data sets that are static, public data sets out there." Gibson expresses confidence the company is not falling into that trap, in part because the company has applied its tools retrospectively to some known data and come up with relationships between drugs and disease that match what was known, showing the process works.
The real test, as he says, is in people, something that requires money and partnerships. Using its capital, Recursion is in Phase I clinical trials for the treatment of CCM, the problem Gibson was studying when he had the epiphany about cell painting and big data. The company also is preparing a Phase II clinical trial of a treatment for the neurodegenerative disease known as neurofibromatosis type II. (Info on Recursion's pipeline can be found on the company website.)
Those kinds of diseases are less resource-intensive in terms of trial costs. Bigger projects require deeper pockets, and Gibson says the really big payoff in clinical results for the company in the next two years will probably come from a study being conducted with a big partner in the area of oncology. "We think it has a chance to leapfrog the other two."
The choice to both partner and to go it solo on some investigations is a flexibility that reflects the value in the platform, Gibson believes. Knowing the idiosyncrasies of data, and knowing how to ask questions of it, has value that can be mined in more ways than one.