By simply changing ICD-10 codes into a series of related numbers, clinical data can be made anonymous so researchers can use it in population studies.
ICD-10 codes are standardized codes for medical conditions used both in billing for services and in clinical research.
The Vanderbilt technique appears to solve an important political problem. Scientists want to use clinical data in population studies to find the specific causes and best cures for disease, cross-referenced to genetic information. But patients rightly fear that their privacy could be compromised.
The Vanderbilt technique solves the problem for both sides. A minimum number, called k, is set where privacy might be at risk. Until that number is reached records are given multiple, related codes before they're reported -- a patient with Type I diabetes is listed as having both Type I and Type II, and vice versa.
Data is not transformed. This is not an encryption code. Instead data that might lead to an identification of anyone is generalized inside a computer to render the identification impossible.
Researchers, of course, would know this is being done, and in the example would not draw conclusions about differences in Type I and Type II diabetes, only conclusions relating to diabetes generally. Once k is exceeded, then data could be reported with more specificity.
Vanderbilt is part of the Electronic Medical Records and Genomics (eMERGE) Network, a nationwide alliance of research institutions looking to combine genetic and clinical data. (The illustration is from the group's home page. It is based at Vanderbilt.)
One of their big efforts is the GWAS project, which aims to ethically combine genetic and clinical databases for common disorders that may have a genetic basis, like cataracts and dementia.
That project, and many others like it, could now get the go-ahead. Go Commodores.
UPDATE: You may ask, as I did, well what's the value of k? That's in the eye of the beholder, says Prof. Malin.
Statistical agencies, such as the U.S. Census Bureau, have tended to lean towards parameterizations that would suggest we use k=5.
The value of k, in other words, is whatever value you think you need in order to guarantee privacy.