We’re still on our data journey, but something’s brewing


Data is the lifeblood of the modern economy. It influences, enables and personalizes how we work, play and engage in social activities. It is crucial for the functioning of the economy and society.

Banks and financial services companies can increasingly be viewed as data and digital service companies with some brick-and-mortar operations. Without trying too hard, this analogy can be applied to all sectors of the economy, including government.

Value comes from creating, using, protecting and sharing data. The “use” of data is a very broad term that includes analysis, storage, aggregation, dissemination and disposal.

Amazingly, however, systematic data sharing is still a struggle. Two of the main reasons for hesitation in sharing data are concerns:

  • that data is generally not “useful” beyond the original reason for which it was collected;
  • about the lack of control when data and products generated from that data are used and reused.

The real life cycle of data often has many twists and turns, and the “hands” that touch the data or data products can involve different actors. A data lifecycle can involve many connections, multiple regulatory environments, sharing in many forms, and different uses for the data once received.

The complexity and the unknown overall paths and consequences mean that many data stewards are so reluctant to share data that they simply don’t. Not sharing is their only guaranteed checkpoint.

What we need are frameworks for handling data appropriately and securely throughout its lifecycle, and guidance on how to safely use products (insights, alerts, decisions) built from data.

Data than the new electricity

Many people have tried to find an analogy for data to help us think about what we have, how we can safely use it, and what we must do to harness its power.

“Data is the new oil/asbestos/water” analogies all have some merit, but they miss a number of fundamental properties of data. A record can be relatively harmless, but together with another record, it can suddenly change. Data can be used and reused without degrading its quality. Data can be shared indefinitely and used differently each time.

My current favorite analogy is to compare data to electricity. It has taken us more than 100 years to develop ways to safely handle electricity of different voltages and currents, but now electricity is literally everywhere in our lives, from lights to vehicles, from computers to digital clocks. We need to develop secure frameworks to work with both 240V data and 24,000V data.

We have changed – we want and we can

I once liked to say that the main reasons for not sharing data fall into “not ready”, “cannot” or “not allowed”. We’ve changed, and a large part of that change has been driven by the need to respond to COVID with (data-driven) situational awareness and actionable insights. COVID increased the use and perceived value of data and insights.

That Intergovernmental agreement on the sharing of data came into force on July 9, 2021. It requires all jurisdictions to share public sector data by default where it is safe, secure, legal and ethical to do so. The agreement recognizes data as a common national asset and aims to maximize the value of data to provide excellent policies and services to Australians.

We are now increasingly willing and allowed to share and use data. What still needs to be addressed are the repeatable patterns of data usage – connecting the “principles” of safe, secure and ethical sharing and use of data to the “bits” in a dataset or data product created from it.

NSW put the pieces of the puzzle together with the NSW AI strategy (a data usage framework for AI) released in 2020, the Smart Places Initiative (Use data to make places “smart”) also in 2020, establishment of an AI review board in 2021 to review real projects, publication of the NSW data strategy in 2021 and most recently the release and obligation to use the AI Assurance Framework March. We are now focused on developing repeatable data usage patterns.

What is a repeatable pattern and how does it help?

Every data set is unique and every product created from that data is unique. But the recipes for sharing and using data can be repeatable patterns. The main elements are:

  • determining whether we can reasonably access the data;
  • assessing whether the data is suitable for the purpose for which we will use it;
  • what guidance or restrictions are required for continued use of the data products created from that use.

The elements that help us determine the repeatable pattern to apply are largely driven by understanding the provenance of the data, the quality of the data, the amount of personal information in the data, the inherent sensitivity of the data itself, and the sensitivity associated with the use of the data Data determines products created.

A key factor in the ability to make these determinations stems from the need for metadata—data about the data—and data about its lifecycle history. This metadata is almost always incomplete. But if we had it, it would be easier to figure out which repeatable pattern to use, as well as identify the safeguards needed to share data (and data products) for more reasons.

A real-world example of data usage: COVID case reporting

In March 2020 the NSW Government committed to publishing daily information on the evolving number of confirmed COVID cases at postcode level.

Concerns about the volume of personal data and the inherent sensitivity of the data. This was offset by the public’s strong desire to be informed of the evolving COVID situation.

A full set of possible fields for publication was compiled from NSW Health sources and then tested for the total amount of information that would be disclosed about an individual (and the individual was identified) if published.

A number of consultations were carried out to balance data published “in the public interest” versus data that was merely “of public interest”. It also considered the risks associated with re-identifying individuals and how much information could be linked to an identified individual.

Some time earlier, a Personal Information Factor (PIF) tool had been developed and tested. The PIF tool was used to develop a worst-case (largest amount of) information cap that would be released if an individual were identified. This tool and measurement process were used to design additional protections (separators, aggregation, obfuscation) for the data before it was released as open data.

The data in the tables with reduced functions are analyzed daily to ensure that the PIF is reduced to an agreed level before publication. The dataset is assumed to be in the form of rows (unique people) and columns (characteristics associated with those people).

The published data was also used to create daily updated spatial maps of COVID cases in NSW. The dataset and maps were updated daily for about two and a half years. Two data product sets were created:

  • High Control Environment: Blend record-level data with personally identifiable information and unique rows. This has been accessed by data stewards and analysts working in the regulatory environment operated during the COVID health emergency in NSW and under conditions of confidentiality.
  • No Control Environment: Raw data with reduced personal information and sensitive information exposed to the public.

So what now?

We need to make serious efforts to refine repeatable usage patterns and create awareness of the fundamental importance of metadata. The good news is that there are a number of new internal standards (ISO, IEC and JTC1) evolving rapidly that will help… but only if we are willing to use them.

Standards and metadata aren’t for everyone, but there’s a lot of serious quality thinking behind a published standard. And there is a standard for preparing a cup of tea (ISO 3103).

dr Ian Oppermann is the NSW Government’s Chief Data Scientist and Industry Professor at the University of Technology Sydney.

More information about the PIF tool can be found here NSW Government website case study. A description of the PIF tool can be found here CSIRO/Data61 article.


Comments are closed.