The IT behind the Cancer Genome Atlas

This month the National Cancer Institute's Cancer Genome Atlas will start receiving tissue samples which will be used to map the genetic data embedded in cancer cells. The side effect of this effort?
Written by Larry Dignan, Contributor

This month the National Cancer Institute's Cancer Genome Atlas will start receiving tissue samples which will be used to map the genetic data embedded in cancer cells. The side effect of this effort? An avalanche of data.

The Cancer Genome Atlas, which was announced in 2005, is designed to map the genomic changes of lung, brain and ovarian cancers initially (see gallery). Partners include academic, commercial and governmental groups.  Building on the Human Genome Project, which turned 10 years old in January, the Cancer Genome Atlas will up the ante in genetic research. That means the data that will need to be crunched will increase exponentially. Technically still in pilot, the Cancer Genome Atlas, which has come under fire in some quarters, should start yielding results in the next year or two.

These genetic research efforts wouldn't be possible without advances in computer science and information technology architecture. The Cancer Genome Atlas relies on open source technology, grid computing and extensive data modeling. The National Cancer Institute also had to set standards on the semantics of genetic information, create processes to manage the information, and make sense of multiple variables.

I spoke with Dr. Peter Covitz, chief operating officer of bioinformatics at the National Cancer Institute. Here's the Q&A:

Why is mapping the cancer genome harder than mapping the human genome?

While the Cancer Genome Atlas (FAQ) is really the most natural follow on to human genome, there's a significant expansion of complexity. The complexity is driven by complexity of cancer and genes. There are many more technical issues on the biology and human side and cancer is very much a moving target. The human genome had to measure the letters of DNA and their sequence. That's just one of tools in the Cancer Genome Atlas. There are other types of tools and data to combine and measure. The sequence of DNA is not the only way to measure cancer. We have to measure the sequence of the codes, the expression of the genes and track how genes are turned on and off. We have lab tools that can do that measurement today. In fact, the lab technology available is a derivative of what was used in the human genome project.

This will generate reams of data and lots of experiments. At first we'll have 20,000 tumor samples and they all have to be compared to normal tissue. There will be more data with diversity. And no one method of analysis will give you a complete picture.

So you have an avalanche of data to sort. How will you manage it?

That's where the informatics will come in. There is a substantial data management component to this. We built a data architecture to track data sets across samples, tumor types vs. normal, and their gene expression. It's a data analysis challenge. Each of these measurement devices has its own data type.

The challenge is getting an agreement on standards for formats, files and data. That's the easier problem. The harder problem is data annotation and semantics. (When Covitz refers to semantics he's referring to the language used to describe the data. The goal is to find common ways to describe concepts that may appear in the tissue samples.)

Do you have to build those data definitions from scratch?

We can leverage what we learned from caBIG [cancer Biomedical Informatics Grid] and apply them. CaBIG was a program I managed to focus on data standards and semantic issues. It moves well beyond a simple dictionary and is a more sophisticated approach to defining data. The system treats the semantic problem as one of concepts not just terms. You can use one or more words to describe a concept.

CaBIG is three years old now and is maturing into a true platform for data management infrastructure.

That (caBIG's data structure) will help us in a lot of ways. We will get data about tissue banks, patients, and their samples and will have to manage that information and then correlate it. All the information will come from different centers and computer systems, but will be eventually integrated. The key is making sure the information can be integrated.

We've already got standards nailed down for the data structure. We've completed a dry run with bogus samples and ran it through pipeline to collect data and gene expression. (Note: Gene expression is the process by which a gene's coded information is converted into operating structures in the cell.) This data was collected and sent to a data coordinating center. We just completed dry run and it looks like it works. Real data generation starts in April.

Overall, though, you need to think of federated architecture because you can't think of one schema for everything. You need to optimize for each grouping of data that's most complementary. You also need to take into account more than one database and system.

What software will be most important to combine the information?

We're taking a middleware approach. Each has piece of data has a corresponding middleware component. Middleware is the programming interface and the way data will be represented. It's the only strategy (that will work for this project). We're also using grid technology as an approach to high performance computing. It has to be a vast virtual system connecting all the parts of the Cancer Genome Atlas program.

The data coordinating center will be responsible for taking feeds from other centers in program, coordinating it and posting for public distribution.

Overall, we're very open source oriented. We have a stack that's mostly based on open source. Jboss is the application server container that we run the caBIG middleware inside of. Most software development is done in Java, but our standards work with any language. We've also taken a model-driven approach. We used UML (Unified Modeling Language) to model the middleware and generate software code. Any object oriented programmer can get the gist of what we've done.

To connect to the grid, though, we've turned Java objects into XML formats. We didn't want to be schlepping Java objects. We wanted to be more platform neutral, so we convert the Java objects into the neutral XML format.

Is there anything on the software side that you could buy off the shelf?

We had to build a lot ourselves because of the data complexity. If something is lacking from business IT it is this inattention to diversity of data for biomedicine.

Business IT optimizes everything for a simplistic set of data models. Semantics are not on anyone's mind. The inattention to the special needs of biomedical revolves around complexity.

It sure would be terrific if the business IT sector took what we've done and saw it as a starting point to get into biomedicine. Vendors need to think of it as an economic center instead of something they can just get to later. It's a pretty rich environment for applying IT. This sector is only going to grow. 

What's your time frame to yield results and do you have enough funding?

We hope to have first of tumor types worked through in the next year or year and a half. In two years, we'll have most of the output from the pilot and wrap it up three years from now.

The samples will start hitting in April from hospitals, then pathologists will process to extract molecules for RNA samples. They will also be sent to other centers to program. From there, the data will start flowing, but there will be several months before they appear in the data coordination center.

The Cancer Genome Atlas is considered to be a pilot. We're looking at just three cancer types--brain, lung and ovarian cancer--in order to iron out the issues. If the data is useful, hopefully we'll scale it up, but we will require more funding. We received $100 million in funding over three years. Of the costs, lab technology represents the bulk of it, but that's because we could leverage previous efforts. CaBIG had $100 million in funding over five years. In our budget, a few percent (3 or 4 percent) is dedicated to IT.

How will samples be tracked?

The tubes and trays are bar coded, but we will still have an interesting problem. We have two separate samples from one individual--the tumor and normal tissue. So one of the first quality checks is to assure the normal tissue and tumor belongs to the same person. The way to analyze that content is at DNA level. We call it molecular bar-coding. If the regular bar code on tube is wrong we can check the DNA. This tracking data will be for internal use.

What's your hardware footprint? You obviously will need some serious computing power to crunch the data.

We have a mixed environment with Linux systems based on Intel. Some Sun. Some HP. Computing is less of a focus for us because we have sites that we can partner with for high performance computing. We're partnering to get access to more mature grid projects that are focused on computational computing. The idea is that we'd make their hardware a node on our grid and we could send an analysis job to their machines.

We try to remind ourselves that we're the National Cancer Institute not the National High Performance Computing Institute. We're not scaled up for computational horsepower. There are participants in our network with separate lines of funding for high performance computing. They have suggested that for a few dollars more we can use their grid. The cost of on-demand computing continues to drop.

What privacy protections do you have to shuffle this data around?

We have a fairly strict and elaborate regime to protect privacy. We're adhering to the rules and regulations that govern genetic data and HIPAA (Health Insurance and Portability Accountability Act of 1996). We also have additional controls and authorization processes to mitigate risk.

We also have institutional review boards to limit how data is used. Under HIPAA, information is only shared for legitimate research and that goes through a data access committee. Even then we still will not share patient IDs. There will be some detailed annotations shared but only with individuals who pass muster with the data access committee.

More reading on the topic: 

Editorial standards