- De-identified data is the bedrock of modern marketing and scientific research.
- Using machine learning, researchers estimate the likelihood that a specific person could be re-identified from anonymous data.
- Nearly all Americans could be re-identified based on 15 demographic characteristics, the research suggests.
We've all done it: When signing up for an account online, we've clicked "I agree" to have our data sold to third parties. It will be anonymized, we're assured, and only a small percentage of data will be made available to others.
But how secure can we be that our personal data can't be traced back to us? That's the central question that a team of researchers at Université catholique de Louvain in Belgium and Imperial College London sought to answer.
The conclusion is — "not very."
Using machine learning, the researchers developed a system to estimate the likelihood that a specific person could be re-identified from an anonymized data set containing demographic characteristics. The researchers' model suggests that over 99% of Americans could be correctly re-identified from any dataset using 15 demographic attributes, including age, gender and marital status.
"While there might be a lot of people who are in their thirties, male and living in New York City, far fewer of them were also born on January 5, are driving a red sports car and live with two kids (both girls) and one dog," said Luc Rocher, a PhD candidate at Université catholique de Louvain and the study's lead author. Personal data can be used for research, illicit activities and even investing, as CNBC has previously reported.
Their paper, "Estimating the success of re-identifications in incomplete datasets using generative models," was published in the journal Nature Communications. Their findings suggest that commonly used anonymization tools like adding noise and sampling data may not be enough to keep up with pro-data privacy laws like the European Union's GDPR and California's Consumer Privacy Act.
The results "question whether current de-identification practices satisfy the anonymization standards of modern data protection laws such as GDPR and CCPA," the researchers wrote.
As part of their research, the trio published an online tool to help people understand how likely it is for them to be re-identified, based on just three common demographic characteristics: gender, birth date and ZIP code. On average, people have an 83% chance of being re-identified based on those three data points, the researchers said.
"The goal of anonymization is so we can use data to benefit society," said Yves-Alexandre de Montjoye, one of the researchers. "This is extremely important but should not and does not have to happen at the expense of people's privacy."