The prospect of storing vast amounts of data on DNA has come closer to reality thanks to a new technique for retrieving data.
Microsoft is keen on synthetic DNA as a future long-term archival medium that could solve the world's need for more data storage. Previous research has shown that just a few grams of DNA can store an exabyte of data and keep it intact for up to 2,000 years.
The drawback is that it's expensive and extremely slow to write data to DNA, which involves converting 0s and 1s to the DNA molecules adenine, thymine, cytosine, and guanine, while getting data back from DNA involves sequencing it and decoding files back to 0s and 1s. Finding and retrieving specific files stored on DNA is also a challenge.
As scientists from Microsoft Research and the University of Washington explain, without random access or the ability to selectively retrieve files from DNA storage, you'd need to sequence and decode an entire dataset to find and retrieve the few files you want. Creating random access would reduce the amount of sequencing that needed to be done.
To achieve random access on DNA, they created a library of 'primers' that are attached to each DNA sequence. The primers, together with polymerase chain reaction (PCR), are used as targets to select desired snippets of DNA through random access.
"Before synthesizing the DNA containing data from a file, the researchers appended both ends of each DNA sequence with PCR primer targets from the primer library," the University of Washington explains.
"They then used these primers later to select the desired strands through random access, and used a new algorithm designed to more efficiently decode and restore the data to its original, digital state."
The researchers also developed an algorithm for decoding and restoring data more efficiently. Microsoft senior researcher Sergey Yekhanin said the new algorithms are more tolerant to errors in writing and reading DNA sequences, which cuts the sequencing and processing needed to recover information.
While it's not the first time random access on DNA has been achieved, it's the first time it's been done at the scale they did it, according to the researchers.
The researchers encoded to synthetic DNA a record 200MB of data consisting of 35 files ranging in size from 29kB to 44MB. The files contained high-definition video, audio, images, and text.
Since releasing the paper describing the technique, they've also encoded and retrieved files from 400MB of data on DNA.
The researchers believe the approach they have used for random access will scale to physically isolated pools of DNA containing several terabytes each.