NSW running Data61 de-identification tool across COVID data prior to public release

The Personal Information Factor tool is claimed by the CSIRO to lower the risk of de-identified data being re-identified.

The New South Wales government has been using a tool to help de-identify data related to COVID-19 prior to the release of that data to the public, the CSIRO said on Thursday.

The tool, dubbed Personal Information Factor (PIF), has been created by Data61, the NSW government, the Australian Computer Society, Cyber Security Cooperative Research Centre (CSCRC), and "several other groups".

"The privacy tool assesses the risks to an individual's data within any dataset; allowing targeted and effective protection mechanisms to be put in place," the CSIRO claimed.

"The software uses a sophisticated data analytics algorithm to identify the risks that sensitive, de-identified and personal information within a dataset can be re-identified and matched to its owner."

NSW chief data scientist Dr Ian Oppermann said the tool was being used on datasets containing data on people who had been infected with COVID-19 before it was made publicly available.

"Given the very strong community interest in growing COVID-19 cases, we needed to release critical and timely information at a fine-grained level detailing when and where COVID-19 cases were identified," Oppermann said.

"This also included information such as the likely cause of infection and, earlier in the pandemic, the age range of people confirmed to be infected.

"We wanted the data to be as detailed and granular as possible, but we also needed to protect the privacy and identity of the individuals associated with those datasets."

Data61 said PIF assigns a risk score to a dataset and makes recommendations to make de-identification "more secure and safe".

The tool is also being used on other datasets such as domestic violence data and public transport usage, Data61 said.

PIF will be made available by June 22.

In a recent submission to a review of the Privacy Act, security researcher Vanessa Teague said de-identification does not work.

"A person's detailed individual record cannot be adequately de-identified or anonymised, and should not be sold, shared, or published without the person's explicit, genuine, informed consent," Teague said.

"Identifiable personal information should be protected exactly like all other personal information, even if an attempt to de-identify it was made."

At the end of 2017, a team of academics, including Teague, were able to re-identify some of the data from a set containing historic longitudinal medical billing records on one-tenth of all Australians.

"We found that patients can be re-identified, without decryption, through a process of linking the unencrypted parts of the record with known information about the individual such as medical procedures and year of birth," Dr Chris Culnane said at the time.

"This shows the surprising ease with which de-identification can fail, highlighting the risky balance between data sharing and privacy."

In September 2016, the same dataset was found by the University of Melbourne team to not be encrypting supplier codes properly. The dataset was subsequently pulled down by the Department of Health.

"Leaving out some of the algorithmic details didn't keep the data secure ­-- if we can reverse-engineer the details in a few days, then there is a risk that others could do so too," the team said at the time.

"Security through obscurity doesn't work -- keeping the algorithm secret wouldn't have made the encryption secure, it just would have taken longer for security researchers to identify the problem.

"It is much better for such problems to be found and addressed than to remain unnoticed."

In response, the Australian government sought to criminalise the intentional re-identification and disclosure of de-identified Commonwealth datasets and reverse the onus of proof, with the aim of applying the changes retrospectively from 29 September 2016.

The changes lapsed at the 2019 election.