In the mid-1990s, the Massachusetts Group Insurance Commission, an insurer of state employees, released healthcare data that described millions of interactions between patients and the healthcare system to researchers. Such records could easily reveal highly sensitive information — psychiatric consultations, sexually transmitted infections, addiction to painkillers, bed-wetting — not to mention the exact timing of each treatment. So, naturally, the GIC removed names, addresses and social security details from the records. Safely anonymised, these could then be used to answer life-saving questions about which treatments worked best and at what cost.
That is not how Latanya Sweeney saw it. Then a graduate student and now a professor at Harvard University, Sweeney noticed most combinations of gender and date of birth (there are about 60,000 of them) were unique within each broad ZIP code of 25,000 people. The vast majority of people could be uniquely identified by cross-referencing voter records with the anonymised health records. Only one medical record, for example, had the same birth date, gender and ZIP code as the then governor of Massachusetts, William Weld. Sweeney made her point unmistakable by mailing Weld a copy of his own supposedly anonymous medical records.
In nerd circles, there are many such stories. Large data sets can be de-anonymised with ease; this fact is as screamingly obvious to data-science professionals as it is surprising to the layman. The more detailed the data, the easier and more consequential de-anonymisation becomes.
But this particular problem has an equal and opposite opportunity: the better the data, the more useful it is for saving lives. Good data can be used to evaluate new treatments, to spot emerging problems in provision, to improve quality and to assess who is most at risk of side effects. Yet seizing this opportunity without unleashing a privacy apocalypse — and a justified backlash from patients — seems impossible.
Not so, says Professor Ben Goldacre, director of Oxford University’s Bennett Institute for Applied Data Science. Goldacre recently led a review into the use of UK healthcare data for research, which proposed a solution. “It’s almost unique,” he told me. “A genuine opportunity to have your cake and eat it.” The British government loves such cakeism, and seems to have embraced Goldacre’s recommendations with gusto.
At the moment, we have the worst of both worlds: researchers struggle to access data because the people who have patient records (rightly) hesitate to share them. Yet leaks are almost inevitable because there is patchy oversight over who has what data, when.
What does the Goldacre review propose? Instead of emailing millions of patient records to anyone who promises to be good, the records would be stored in a secure data warehouse. An approved research team that wants to understand, say, the severity of a new Covid variant in vaccinated, unvaccinated and previously infected individuals, would write the analytical code and test it on dummy data until it was proved to run successfully. When ready, the code would be submitted to the data warehouse, and the results would be returned. The researchers would never see the underlying data. Meanwhile the entire research community could see that the code had been deployed and could check, share, reuse and adapt it.
This approach is called a “trusted research environment” or TRE. The concept is not new, says Ed Chalstrey, a research data scientist at The Alan Turing Institute. The Office for National Statistics has a TRE called the Secure Research Service to enable researchers to analyse data from the census safely. Goldacre and his colleagues have developed another, called OpenSAFELY. What is new, says Chalstrey, are the huge data sets now becoming available, including genomic data. De-anonymisation is just hopeless in such cases, while the opportunity they present is golden. So the time seems ripe for TREs to be used more widely.
The Goldacre review recommends the UK should build more trusted research environments with the fourfold aim of: earning the justified confidence of patients, letting researchers analyse data without waiting years for permission, making the checking and sharing of analytical tools something that happens by design, as well as nurturing a community of data scientists.
The NHS has an enviably comprehensive collection of patient records. But could it build TRE platforms? Or would the government just hand the project wholesale to some tech giant? Top-to-bottom outsourcing would do little for patient confidence or the open-source sharing of academic tools. The Goldacre review declares “there is no single contract that can pass over responsibility to some external machine. Building great platforms must be regarded as a core activity in its own right.”
Inspiring stuff, even if the history of government data projects is not wholly reassuring. But the opportunity is clear enough: a new kind of data infrastructure that would protect patients, turbo-charge research and help build a community of healthcare data scientists that could be the envy of the world. If it works, people will be sending the health secretary notes of appreciation, rather than his own medical records.
Written for and first published in the Financial Times on 1 July 2022.
The paperback of “The Next 50 Things That Made The Modern Economy” is now out in the UK.
“Endlessly insightful and full of surprises — exactly what you would expect from Tim Harford.”- Bill Bryson
“Witty, informative and endlessly entertaining, this is popular economics at its most engaging.”- The Daily Mail
I’ve set up a storefront on Bookshop in the United States and the United Kingdom – have a look and see all my recommendations; Bookshop is set up to support local independent retailers. Links to Bookshop and Amazon may generate referral fees.