X
Business

Biotech data mining

In the last ten years, biotech companies have been busy accumulating mountains of data. And because it's becoming more and more difficult to find useful information through all this data, the European Union has started the BioGrid project. And many of the tools developed by BioGrid are available for public use.
Written by Roland Piquepaille, Inactive

In the last ten years, biotech companies have been busy accumulating mountains of data. And it's becoming more and more difficult to find useful information about interactions between genes and proteins for example. It's one of the reasons why the European Union has started the BioGrid project. In Mining biotech's data mother lode, IST Results describes this project. Among current results, the researchers involved in it have delivered a better search engine for PubMed by analyzing over-expressing genes and predicting the protein interactions that are likely occurring. And many of the tools developed by BioGrid are available for public use -- even by yourself.

Here is the introduction from IST Results.

A EU-sponsored project has developed a suite of tools that will enable biotech companies to mine through vast quantities of data created by modern life-science labs to find the nuggets of genetic gold that lie within. The BioGrid project brought together six partners from the UK, Germany, Cyprus and The Netherlands to address one of the key problems facing the life sciences today.
"How to integrate the huge volume of disparate data -- on gene expression, protein interactions and the vast output of literature both inside and outside laboratories -- to find out what is important," says Dr Michael Schroeder, Professor of the Bioinformatics group at Dresden Technical University and coordinator of this IST-funded project.

Below is an illustration showing the crystal structure of a protein in the active state bound to guanosine triphosphate, GTP (Credit: BioGrid project).

Crystal structure of a GTP-bound protein

This picture has been extracted from a paper named "How to query the GeneOntology" (PDF format, 10 pages, 323 KB) published in 2005 in the Proceedings of KRBIO'05, a symposium on knowledge representation in bioinformatics.

Here is a description of how the software works.

One element of the software suite analyses over-expressing genes discovered during micro assays to establish what proteins become encoded. This uses standard techniques.
A second analysis tool in the suite predicts what possible protein-protein interactions are taking place. This is novel. When a gene encodes a protein, the protein folds up into a unique shape, forming a 3D structure. This structure can only interact, or fit, with some proteins, but not others, like pieces of a jigsaw puzzle.
BioGrid's protein interaction software includes a database of the 20,000 known protein structures and uses that database to identify which ones could potentially interact, among the thousands of proteins created by the over-expressing genes.

And the researchers have applied this process to develop a Gene Ontology (GO) as a vocabulary to describe all the different genetic processes. They then used this vocabulary to mine through the 15,000,000 entries of PubMed.

Below is a screenshot showing the user interface of GoPubMed. "This screenshot displays the results for the query 'levamisole inhibitor' limited to 100 papers. On the left, part of the GO relevant to the query is shown and on the right the abstracts for a selected GO term. The search terms are highlighted online in orange and the GO terms in green. Right of each abstract is a list with all the GO terms for that abstract ordered by an accuracy percentage" (Credit:BioGrid project).

The user interface of GoPubMed

This picture has been extracted from a paper named "GoPubMed: Exploring PubMed with the GeneOntology" (PDF format, 4 pages, 594 KB) published in 2005 by 'Nucleic Acid Research'.

And the results of this project are available to everyone, even if there are some limitations for non-paying users. For example, the ontology-based search is available at GoPubMed.org, while the protein interaction database is at Scoppi.org. You also can check the the Gene Ontology for more details.

Sources: IST Results, December 20, 2005; and various web sites

You'll find related stories by following the links below.

Editorial standards