Govt, hurry up with releasing data

A programmer scraped data from the My School website to make some really cool heat maps showing regions of smart schools — no thanks to the government, which didn't supply the data in any useful kind of format.
Written by Ben Grubb, Contributor

A programmer scraped data from the My School website to make some really cool heat maps showing regions of smart schools — no thanks to the government, which didn't supply the data in any useful kind of format.

Joel Pobar, a former Microsoft employee, showed on his blog how he combined the data he scraped with Google Maps to show visual heat maps showing which regions offered the best education.

He marked schools across the state as either green or red in colour on Google Maps, depicting how good the school's average was. It made for an interesting map, and from first impressions, revealed some very interesting information visually, such as city schools having better averages than rural schools.

NSW with the My School data applied

NSW with the My School data applied
(Credit: Joel Pobar)

But these images weren't easy to produce. Pobar had a lot of trouble getting the raw data (data that is straight from its original source). He ended up "scraping" the data from the My School website, something he didn't likely have permission to do.

Data scraping, as defined by Wikipedia, is a technique in which a computer program extracts data from human-readable output (such as a web page).

In Pobar's case he needed to extract data from the My School site and export it into a format that his code could understand. It is, however, something any programmer would try to avoid as you can end up with all sorts of nasties if the data isn't extracted correctly.

The scraping process took him around four hours, four hours of his life he could have had back if the government had provided the data for developers to use. "Why didn't the government just offer up the raw data and let the programmers of Australia mash it up ... or at least give me a feed of the raw data to save me some time," Pobar said on his blog.

You see, government website data, by default, is not licensed under a creative commons licence (oh how nice that would be!). Although we pay taxes to the government, we don't own the information it produces — that data is Crown data; data we need to get permission to reproduce. So if Pobar wished to publish his work, he would need to seek permission to do so. If he wanted to earn money from the work, well that's another kettle of fish.

The My School's copyright statement says:

Copyright in the content and design of this website, including publications and logos, is owned by or licensed to the Australian Curriculum, Assessment and Reporting Authority (ACARA).

Subject to uses permitted under the Copyright Act 1968, you may only download, display, print and reproduce this material in unaltered form only for your personal, non-commercial educational use or non-commercial educational use within your organisation. However, unless otherwise indicated, this permission does not extend to reproduction, communication to the public, publication or other use of the work (in whole or in part) on an external website, intranet site or equivalent media.

This has been an issue the Government 2.0 Taskforce had attempted to try and fix late last year by creating a contest designed to entice programmers to use government data.

In creating the competition, it also released a new website called data.australia.gov.au, which lists a whole bunch of raw data sets available for people to use.

This is a great step forward, but we need more of it. At the time I wrote this, the date of the last data release on the site was 11 December. That's last year! Also, some of the data wasn't raw, it was in excel spreadsheets which weren't comma separated or easily usable for mashups. That definitely needs improvement.

Of course, I understand that not all data can be released.

I was at a mashup event last year where an Australian Bureau of Statistics employee faced down angry developers calling for the release of data. He said that if it were to release most of its raw data it could allow people to figure out sensitive information about other people or companies.

"Some people are that brilliant that they can work out how much companies earn, what their profit margins are and all of that — and that's something that we have to kind of avoid," said Anthony Zuza, quality assurance manager at the ABS.

I guess finding the right balance is going to be tough, and is something which is slowing the government's hand at releasing information in the right format. Hopefully we can get over this, so that developers can start doing more cool things.

Editorial standards