Nvidia, Harvard researchers use AI to find active areas in cell DNA

Using a new deep learning toolkit called AtacWorks, researchers are studying how diseases and genomic variation influence specific types of cells in the human body.
Written by Stephanie Condon, Senior Writer

Researchers from Nvidia and Harvard are publishing research this week on a new way they've applied deep learning to epigenomics  -- the study of modifications on the genetic material of a cell.  

Using a neural network originally developed for computer vision, the researchers have developed a deep learning toolkit that can help scientists study rare cell types -- and possibly identify mutations that make people more vulnerable to diseases. 

The new deep learning toolkit, called AtacWorks, "allows us to study how diseases and genomic variation influence very specific types of cells of the human body," Nvidia researcher Avantika Lal, lead author on the paper, told reporters last week. "And this will enable previously impossible biological discovery, and we hope would also contribute to the discovery of new drug targets." 

AtacWorks, featured in Nature Communications, works with ATAC-seq -- a popular method for finding the parts of the human genome that are accessible in cells. 

Just about every cell in your body carries a copy of your genome sequence -- a sequence of your DNA about 3 billion bases long. However, only certain parts of the genome sequence are accessible to certain cells. Every cell type -- whether it's liver, blood or skin cells -- can only access the regions of DNA they need for their respective function.  

"That allows us to understand what makes every type of cell different from each other, or how every type of cell is affected in disease, or in other biological changes," Lal said.


ATAC-seq finds those accessible parts by producing a signal for every base in the genome. Peaks in the signal denote accessible regions of DNA. This method typically requires tens of thousands of certain kinds of cells to get a clean signal. This makes it challenging to study rare cell types, like the stem cells that produce blood cells and platelets. 

However, by applying AtacWorks to ATAC-seq data, the researchers found they could rely on just tens of cells, rather than tens of thousands. In the research described in their new paper, the Nvidia and Harvard scientists applied AtacWorks to a dataset of stem cells that produce red and white blood cells. They used a sample set of just 50 cells to identify distinct regions of DNA associated with cells that develop into white blood cells, as well as separate sequences that correlate with red blood cells.

AtacWorks is a PyTorch-based convolutional neural network that was trained on labeled pairs of matching ATAC-seq datasets -- one high quality and one "noisy." The model learned to predict an accurate, high-quality version of a dataset and identify peaks in the signal. 

Running on Nvidia Tensor Core GPUs, the model took under 30 minutes for inference on a whole genome, a process that normally takes 15 hours on a system with 32 CPU cores.

Lal noted that the researchers were able to train the model on any type of cell and then apply it to any different type. 

"That's a really wonderful thing because it means that we can train models using whatever data we have available and then apply it to entirely new biological samples," she said. 

The model could help deliver insights into a range of diseases, including cardiovascular disease,  Alzheimer's disease, diabetes or neurological disorders. It's available on the NGC Software Hub, Nvidia's hub of GPU-optimized software, where any researcher can access it. 

"We are hoping that once our paper comes out, other scientists working with different diseases would also pick up this technique and be interested in using it," Lal said. "And we are excited to see what new research and new developments that can enable."

Editorial standards