A tech company decides to prioritize diversity in the workforce. To keep themselves accountable to this goal, they create a quarterly diversity report. It tabulates ethnicity, sexual orientation, and gender identity statistics.
In Q1, the company had 20 employees and 2 of them (10%) identified as LGBT. By Q2, they’re hiring at a slower pace, but they do bring on one new engineer - Matt. Matt’s gay, but he isn’t out to his family. With his hiring the company will have 21 employees, 3 of them (14%) identifying as LGBT. They publish a blog post welcoming Matt to the team.
It doesn’t take powerful statistics to notice the de-anonymization risk: anyone who can read the company blog and view their statistics can out Matt, and likewise deduce private details about other people who join and leave the company.
In 1993, Jen moves to a small town in Montana. As the years pass, her town becomes smaller and smaller, with a local city drawing most of the population of the surrounding woods. Eventually Jen is the only woman born in 1970 in the town.
Jen lives in the United States, and the national census body, the US Census Bureau, records statistics based on different measurement units - blocks, tracts, states, and so on. Although they carefully obscure exact values and attempt to anonymize the data, with a little first-hand knowledge about Jen, we know that she’s the only person in one of the categories. The average salary is her exact salary, the average age is hers, and the same for all other values.
A citizen submits a public information request for New York City’s taxi data - a dataset consisting of start & end points, trip times, trip durations, and taxi license numbers. This data has already been successfully used for urban and transit planning: when the city redesigned its roads and bicycle infrastructure, taxi data helped measure the impact on traffic flows.
The FOIA officer tries to sanitize the personally identifiable information in the records: taxi medallion and hack numbers. But they use MD5, a common one-way hash function, to try and obscure these values. Because the input data is predictable - consisting of 5 or 7-digit numbers - it’s simple to run the same function over all possible numbers, match them up, and yield a de-anonymized dataset. Which is exactly what happens.
Each story starts with good intentions. The federal government uses the Census to allocate funding and resources. Startups use diversity reports to keep internal and external focus on their efforts to hire equitably. Anonymized transit data has helped guide transportation infrastructure planning.
But anonymization and aggregation sometimes fail to deliver individual privacy. I noticed the problem with diversity reporting when I was building such a report, and found that k-anonymity offers practical solutions and a solid mathematical foundation. The Census has a rich history of discussing, counteracting, and researching re-identification - thankfully, the story about Jen almost never happens, because the Census statisticians have experience, domain knowledge, and a mandate to respect privacy.
Anonymization and re-identification are severely under-discussed - straightforward data leaks get most of the attention. Privacy is conceived as a binary public/private switch, rather than a spectrum. But the majority of data collected about us is collected as aggregate, and it’s likely that stronger privacy regulations like GDPR will encourage more companies to anonymize the data they collect and store.