Understanding more about diabetes is key to treating and preventing the disease. Germany's DZD centre of diabetic research is using graph database technology from Neo4j to learn more about the illness. ZDNet spoke to DZD's head of data and knowledge management, Dr Alexander Jarasch, to find out more.
ZDNet: Can you tell me about your organisation?
Jarasch: We are a non-profit organisation doing research on diabetes. We are not developing drugs but are interested in research into the prevention of diabetes and the treatment of diabetic patients.
We combine different entities or different disciplines, which means that in hospitals we are connecting basic research and adding models to it. And in this research we are using graph technology.
Here in the UK diabetes is massively on the increase, is it a similar story in Germany?
It is similar, as it is in the US. Roughly 10 percent of the population are diabetics. In children, it is mostly Type 1 diabetes and it was always assumed that it was the older people who had Type 2 diabetes. But it turned out that two thirds of the population with diabetes is of a working age. Diabetes is a critical disease for people of working age. Those with diabetes have less productivity, they are getting sick much earlier, and they have lots of complications like strokes or heart attacks. Obviously, this is an area where we are doing a lot of research.
So where does Neo4j's software fit into this?
In biology or medicine, data is connected. You know that entities are connected -- they are dependent on each other. The reason why we chose graph technology and Neo4j is because all the entities are connected.
And we have our data in various relational databases but we wanted to build a new layer on top of these different datasets or databases in order to gain a much wider knowledge of diabetes and to see the metabolic disease from many different perspectives simultaneously.
What particular aspects of diabetes are you looking at?
We are connecting patients' data, so we have prevention and lifestyle intervention data from the clinics. We have bio-samples -- samples of blood, urine, liver samples, kidney samples and things like that. Then we can measure specific parameters and animal models like mice or pre-diabetic pigs. Then we have all the basic research -- genomics, lipidomics and so on. So it's all kinds of research around specific molecule classes.
You obviously collect a lot of data covering a lot of areas. How do you bring this down into a form that you can use?
We started with Neo4j about one year ago and we tried to connect these different disciplines -- hospitals, basic research, animal models and so on. We tried to connect them on a very simple data model that is valid for all the different disciplines and different locations and we tried to learn from these different techniques together in a new way. Because nowadays one discipline is not sufficient anymore to answer a biomedical question.
What area do you find the most interesting?
Personally, I think the biomarker findings are the most interesting. You can use graph technology and very modern machine-learning techniques to provide better prevention or treatments for diabetic patients. That's what's driving me.
The second point -- what I find really interesting -- is that with graph technology you can not only connect data, but when I think about our medical doctors or researchers that are not computer scientists, they find it much easier to look at the data models. That's because after some short practice they are able to decipher the data. With graph technology it's very easy to visualise the data. We use the visualisation browser of Neo4j to visualise the data and spot the areas that are connected to each other. Users like it because it's so intuitive.
Are there other aspects that you find particularly useful?
Absolutely. Particularly when we have these different data points from all the different disciplines. We can ask the graph: "Are any of these nodes connected to these different nodes?"
That's very interesting, because some of these relationships and connections we may have seen before or not. In some cases it's like linking hundreds and hundreds of Excel sheets or relational databases to it.
On the other hand, there are areas where we do not see any connections of the data, but with graph technology it is so easy to connect them and find the connections or relations between them.
And this is what makes it easy because so many graph algorithms are already ready to use, and you can easily use them with [Neo4j] APOC Libraries or different cipher queries.
What other areas can you use it for?
Besides our organisation there are five other German organisations studying cancer, Alzheimer's, infectious diseases, lung diseases and so on. And there are already connections between these different diseases. And this is where you find connections between Alzheimer's and diabetes. There are also connections between cancer and diabetes.
We want to also study where a disease comes from. Where do these complications with diabetes come from and are they related to Alzheimer's and other complications?
For this we develop a small prototype using natural languages processing and this is also a very interesting topic because we have a public database of peer-reviewed articles. There are over 30 million scientific items from different publications there. Of course, nobody in the world can read all this -- you can't be up-to-date every day or every week. So, we try to analyse those texts according to our specific, bio-medical questions.
We want to analyse these texts automatically with natural language processing and we want to learn some specific words or key words like "gene". Is there a gene called ABC co-appearing in lots of texts linked with diabetes? Or are there many genes that are co-mentioned with cancer and diabetes?
When we learn from these texts automatically, we can feed them into our graph for genes and proteins and then we can write a pipeline to study questions like: "Is this gene related to some data that we have in our organisation?"
We also connect public data with data that is in our organisation. Sometimes, or many times, this information is protected by the GDPR so we have patient data that we can't make public but within our organisation we can link this data via graph technology.
This is a lot of information. Do you have any idea of the size of the files you are dealing with?
That's a tricky question. The smallest datasets are one thing but we can go to our large files, such as high-resolution microscopies, where one dataset is between 40 and 200 GB, and we have hundreds or thousands of them.
With the global Internet of Things (IoT) market expected to exceed $721bn by 2023, many organizations are going to be searching for a way to process all of their IoT data quickly, and in a manner that allows them to derive real business value.