The immense amount of data governments collect every day is enormously useful and valuable, but it is also mostly siloed and inaccessible even to those that own and manage it.
If the government owners of data can't use it effectively, what hope have citizens who may want to use it to inform their own life choices?Statistics New Zealand tackled those dual problems by developing what it calls an Integrated Data Infrastructure (IDI) for use by prequalified researchers on approved projects.
Kelvin Watson, deputy chief executive organisation capability and services at Statistics NZ, says the current government is very focused on data-driven decision making. In 2013 it commissioned a piece of work called "Analysis for outcomes" and the IDI was one result.
Statistics was well positioned. Apart from its role as the lead agency for the Census and other studies, it had already been integrating government data for over a decade, effectively running an integration service for the public sector.
Now, in secure data labs dotted around the country, researchers are using anonymised data from across government to deliver data-driven policy and to build tools to help citizens make better decisions.
And while data in the IDI needs to be constantly refreshed to maintain its currency, Statistics is already trialing a tool that could enable analysis in real time.
Watson says all individual identifiers are stripped out after the various IDI data sets are matched. The only people who can access the data have to prove bona fide research requirements. There is stringent vetting both of researchers and of projects, he says.
The data labs in which research takes place are not connected to internet and at the end of the research, all outputs are checked to ensure they adhere to confidentiality rules.
"Individuals can never be identified and the information is never used for case management decisions for individuals," Watson says.
Data comes into the IDI from the likes of the Inland Revenue Department, Ministry of Social Development and the Ministry of Education in various formats. It is then processed and mapped into tables for use.
The strength of the IDI is the data is already linked through a name or a common number for individuals or by a probabilistic measure - such as a record having the same name and date of birth - before it is anonymised.
"Data integration is where a lot of the power is," Watson says. "We don't have researchers having to work that out. It's already done."
Despite that, it's not a case of coming in and ticking a few boxes, but Statistics makes a set of standard research tools available as well.
"You still need to have quite a few smarts around how you manage your data," Watson says.
"The whole point of the IDI is about the life pathways," says Watson. "What happens to someone in education and what does it mean for them in employment? What does low education achievement mean in terms of potentially going into a life of crime or whatever.?"
The social policy aspects are the easiest to describe but Watson says the spectrum of research is broad.
One piece of work looked at study options, what individuals studied and then tracking that through to what they are doing in employment. From that came a careers tool to help young people make their own informed study decisions.
But already Statistics is trying to go one better, trialing a tool from Melbourne based Space Time Research that could enable data analysis in real time.
"It's potentially the next generation but it's still a twinkle in the eye at the moment," Watson says.
Where IDI data dates and needs to be refreshed. the Space Time tool uses a federated data model.
That could allow a researcher to access a front end and say they want to look at this cohort from the Ministry of Social Development connected to that cohort from Education and do some analysis.
The tool would then draw up-to-date data down, already linked and confidentialised.
"It means data would be only ever held at the home agency," Watson says.
All up, it's pretty advanced stuff.
"When we talk to national statistics offices around the world they look at the IDI and say they need to get there too," says Watson.
- The IDI is primarily an extended star schema using Microsoft SQL Server, entirely hosted internally.
- The data is refreshed quarterly to contain any new information from contributing agencies and involves a new linking process.
- Researchers are offered as standard SQL, SAS, R and Stata tools for analysis If tools such as ArcGIS or Python are required, Statistics will investigate the feasibility of implementing them.
- The Space Time Research tool proof of concept was carried out at Land Information NZ (LINZ) earlier this year to test the viability of the federated model. That proved to be successful. Work is now underway to identify more detailed requirements.