New file format helping researchers reduce DNA analysis time

Processing data produced from DNA nanopore sequencing now takes half a day, instead of two weeks.
Written by Aimee Chanthadavong, Contributor

The University of New South Wales and the Garvan Institute of Medical Research have developed a new computer file format to speed up nanopore sequencing analysis and improve specialised treatments for patients with cancer and other diseases.

Published in Nature Biotechnology, the research said the newly developed SLOW5 format can process complex DNA nanopore sequencing "more than 30 times faster" than the previous file format called -- ironically -- FAST5.

Nanopore sequencing is used to identify a range of diseases and help medical professionals analyse DNA samples in detail so they can provide tailored treatments for cancer patients.

The data produced from this process was routinely recorded in FAST5 file formats, which produced large files of around 1.3 terabytes, equivalent to roughly 650 hours of high-definition video. Due to its large size, it would take two weeks for computers to process the FAST5 files, the researchers said.

However, lead author and Garvan Institute genomics computing systems engineer Hasindu Gamaarachchi said processing data for human genome using SLOW5 is reduced to half a day.

He explains that unlike FAST5, the SLOW5 format enables parallel computing whereby several processors can simultaneous execute multiple, smaller analyses broken down from larger, complex, and complete dataset.

"You can think of this like trying to dig a very big hole with 10 people, but there is only one shovel they have to share round. That's how it used to be with FAST5," he said.

"But with SLOW5 everyone gets their own shovel, and they can all dig at the same time and do the job much faster.

"The FAST5 format is slow because the data cannot be accessed in parallel. It is based around the Hierarchical Data Format which was designed in the 1990s to work on machines which at the time only had one processor, rather than the modern ones which include multiple processors.

"The Hierachical Data Format is also generic, whereas the SLOW5 is purpose-built. So in terms of the digging analogy, it's like we are also providing a shovel that is specially designed for the type of soil. And because the new SLOW5 can be accessed in parallel by multiple processors at the same time, the processing time has reduced by a factor of 30."

Related Coverage

Editorial standards