The dataset containing historic longitudinal medical billing records of one-tenth of all Australians, approximately 2.9 million people, has been found to be re-identifiable by a team from the University of Melbourne, with information such as child births and professional sportspeople undergoing surgery to fix injuries often made public.
The team, consisting of Dr Chris Culnane, Dr Benjamin Rubinstein, and Dr Vanessa Teague, warned that they expect similar results with other data held by the government, such as Census data, tax records, mental health records, penal data, and Centrelink data.
"We found that patients can be re-identified, without decryption, through a process of linking the unencrypted parts of the record with known information about the individual such as medical procedures and year of birth," Dr Culnane said.
"This shows the surprising ease with which de-identification can fail, highlighting the risky balance between data sharing and privacy."
Although the released dataset has a two-week perturbation of dates of medical events, the team said increasing the perturbation would not make much difference, since in the case of re-identifying older mothers or unique procedures, for instance, there are very few data points to obscure.
"Overall, including the re-identifications from childbirths, sports injuries, and single surgeries, we devised 43 queries and found seven unique matches," the team said.
With additional data such as credit card history, it is increasingly easily to create fingerprints of people, and could be happening within private medical insurers and people would not know, the team said.
"A private health insurer (for example) could efficiently track the medical records of past customers through the decades of data, or derive extra information they didn't know about from current customers," they said. "This would be a clear breach of privacy that would possibly never be reported, even though the data could lead to detrimental decisions for the individual in the future."
The team warned that the problem with releasing datasets with personal information is that it could be used far off into the future with additional information from other sources to re-identify people.
"Data about people should be much more carefully considered," the team wrote. "It is very unlikely that even the most well-informed and well-intentioned set of guidelines on de-identification can guarantee privacy protections appropriate for sensitive data such as the MBS/PBS 10 percent sample while retaining the usefulness of the data."
"Taking advantage of the benefits of big data without seriously compromising privacy is one of the most difficult engineering challenges of our time."
The team said the Department of Health was notified of the problems with the dataset on December 2016.
In September 2016, the same dataset was found by the University of Melbourne team to not be encrypting supplier codes properly. The dataset was subsequently pulled down by the Department of Health.
"Leaving out some of the algorithmic details didn't keep the data secure -- if we can reverse-engineer the details in a few days, then there is a risk that others could do so too," the team said at the time.
"Security through obscurity doesn't work -- keeping the algorithm secret wouldn't have made the encryption secure, it just would have taken longer for security researchers to identify the problem.
"It is much better for such problems to be found and addressed than to remain unnoticed."
As a result of the issues found, in October last year, the Australian government proposed changes to the Privacy Act that would criminalise the intentional re-identification and disclosure of de-identified Commonwealth datasets, reverse the onus of proof, and be applied retrospectively applied from September 29, 2016.
Under the changes, anyone who intentionally re-identifies a de-identified dataset from a federal agency could face two years' imprisonment, unless they work in a university or other state government body, or have a contract with the federal government that allows such work to be conducted.
Writing in their paper published on Monday, the team said the proposed legislation will have a chilling effect on research, and risks efforts to make sure open data is properly protected.
"Whilst open data is not a safe approach for releasing this type of data, open government is the right paradigm for deciding what is," the team said.
"One thing is certain: Open publication of de-identified data is not a secure solution for sensitive unit-record level data."
The Department of Health told ZDNet it was not aware of anyone being identified through the dataset.
"This matter dates back to 2016 and since then the Australian government has taken further steps to protect and manage data," a spokesperson for the department said. "The project was halted and remains halted, and the dataset was removed immediately.
"The department is working with the University of Melbourne and has already acted to improve its processes."
The Office of the Australian Information Commissioner (OAIC) said it had opened an investigation in September 2016, and as it was ongoing, would not comment on the subsequent issues.
"The commissioner will make a public statement at the conclusion of the investigation," OAIC said.
Australia's Privacy Commissioner has said the de-identification of data is an area requiring regulation, and that agreed industry standards could be useful to fill the public with confidence.
The government has launched a database with innovation, digital opportunities, governance, infrastructure, investment, sustainability, jobs, skills, and housing information on the nation's most populous cities and regions.
The Australian government will introduce amendments to the Privacy Act to criminalise the re-identification of de-identified data, with the law to take effect from Wednesday.
Moving the authentication platform, educating citizens, and stricter privacy controls were among the steps recommended to the Department of Human Services by a review into heath providers' access to the Health Professional Online Services system.
Committee is to report by October 16 on how Medicare details appeared on the dark web.