Big data a SURE thing for healthcare

The inability to share and analyse Australia's healthcare data has prevented researchers from using big data for analysis, but that's about to change.

Australia has one of the most comprehensive collections of population data on healthcare, but until now, analysing it has been quite limited. The data is distributed amongst various sources, and the systems on which these datasets are located are typically unsuitable for complex analysis.

However, the Sax Institute has today launched a project called the Secure Unified Research Environment (SURE) that aims to overcome both limitations by providing a central datacentre where researchers can form connections between data sources and access the necessary computing power to perform big-data analysis. Health researchers will be able to securely access health information provided by hospitals, cancer registries, clinical trials, general practices and research studies.

The project was funded by the Department of Industry, Innovation, Science, Research and Tertiary Education, as well as the NSW Government.

While the environment was only officially launched today, it has already been used to analyse Australian habits on visiting the same general practitioner, or "consistency of care", which has been linked to the long-term quality of healthcare.

Previously, data on consistency of care would have been analysed by looking at data from the Medicare Benefits Scheme to determine who does or doesn't use the same practitioner, but this doesn't provide much insight. By combining the Medicare data with the Sax Institute's own "45 & Up" study, as well as data from the Pharmaceutical Benefits Scheme, researchers were able to discover more.

For instance, the analysis was able to confirm the intuitive beliefs that the older Australians get, the more likely they are to have higher consistency of care, and that remote areas of Australia have lower consistency of care. However, it also revealed that wealthier and more highly educated Australians have lower consistency of care — a finding that would not have been possible to determine by looking at a single dataset alone.

Privacy in SURE

Any time large sets of data are linked and mined for information, privacy concerns rise to the forefront. Recognising this, SURE divides personally identifiable information and the content of health records, such as medication information or hospital-admission information.

Organisations called data-linkage units receive encrypted record ID numbers and personal identifying information from sources, but no healthcare records. The linkage units then create their own unique, randomly generated IDs for each person. When a research project requires the data, these randomly generated IDs are used as a basis to create project-specific IDs.

Once the data-linkage units and the data source agree on what information a project is allowed, the data-linkage unit sends the encrypted record ID numbers as well as the new set of project-specific IDs to the source. The data source is then able to decrypt the record ID numbers and match them up to any healthcare records, free of personally identifiable information. The information is then made accessible on SURE's systems for analysis.

The end result is that any researcher logging in can only identify an individual by a SURE-generated ID, unrelated to that person's original record ID. As the ID is specific to the researcher's project, it also means that researchers cannot collude with other researchers working on different projects to gain more personally identifiable information than originally permitted.

Security in SURE

To use SURE, researchers first need to have their project approved by the data owners, and also by a human research ethics committee. After this, researchers' access is limited to secure computers. These are scanned for malware prior to the log-on process, and a security certificate is installed on the permitted computer to limit access.

To log in, the researcher's account has certain minimum-length password requirements and the necessity of a physical hardware token, which must be plugged in to the nominated computer via USB. If any of the three factors — the certificate, USB token or password — are missing, tampered with or incorrect, access to the system is denied.

Although steps are taken to make the researcher's computer more secure than the average PC, analysis of data is still performed remotely. Researchers use their computer as a virtual terminal to tap into SURE's infrastructure, running a highly specified version of 64-bit Windows 7. These virtual workstations have a number of proprietary and open-source analysis tools pre-installed, or installed as needed after being assessed for security. This allows researchers to conduct their analysis using the larger computing power of SURE's facilities.

As analysis may take time to complete, researchers can log off and leave their analysis running in the background, logging in from another secured location at a later date.

The datacentre has a tier-3 ranking, and numerous other physical security features have been implemented, such as the removal of access to printers and removable media within the facility.

Even then, the data flow of all files entering and leaving SURE passes through a privately developed audit tool called the Curation Gateway, providing administrators with the ability to flag any suspicious activity and take immediate action.

Once research has been completed, any remaining data files are encrypted and then archived only for as long as needed. After this time, they are destroyed.