Data anonymization is the process of manipulating a dataset so that individuals can no longer be identified. It can be achieved via a range of techniques, ranging from stripping out identifiers to aggregating data. We often use data anonymization practices to preserve user privacy and comply with regulations.
Varying regulatory bodies approach the subject differently, with Europe’s General Data Protection Regulation (GDPR) having one of the strictest definitions of anonymous data:
“information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”
In simpler terms, if data is anonymized it means that data subjects can no longer be identified from it.
Complying is not as easy as it sounds.
The naïve may think that all they have to do is strip the names out of data. In some circumstances, this may be all it takes. But if the data also includes things like Social Security numbers, phone numbers, or email addresses, these will have to go as well. You may even have to strip out things like age, postal code, birthdate and more.
Anonymization is complex
It all depends on what data you have, what you are using it for, and why it needs to be anonymized. The process of anonymization involves either removing information or introducing less accuracy into the dataset, which can result in it being less useful.
In order to anonymize data without putting your organization at too much risk, you will need to understand which regulations your organization is subject to and conduct a risk assessment. If data is insufficiently anonymized, it may be possible for attackers to re-identify subjects, which could result in significant harm to the individuals and big fines for your company.
On the other hand, removing too much information could render the dataset useless, making the whole endeavor pointless. Of course, you must always meet your compliance obligations, but going to the extremes of protecting anonymity may not be the best solution either.
In some situations, you may determine that keeping the data in a form that’s usable for your purposes carries too high a risk of re-identification. In these cases, you should not proceed.
To anonymize data, you must strip out all direct identifiers. These are attributes that can be directly linked back to individuals, such as:
- Email address
- Phone number
- Social Security number
- Passport number
- Driver’s license number
- Credit card number
There are many more examples of direct identifiers, but they are all pretty straightforward. They are attributes that you can easily link to an individual’s identity. Let’s say that you did a bad job of anonymizing your dataset and left in people’s phone numbers. If an attacker gets their hands on the dataset, and also has some of the individual’s phone numbers, they could easily be re-identified.
Indirect identifiers are a little more complicated. They are pieces of information that can indirectly lead to an individual from the dataset being re-identified. Indirect identifiers can include things like:
- Postal code
It’s easiest to explain them through a simplified example. Let’s say that you’re a teacher and you have tried to create an anonymized dataset of your class. You stripped out all of the direct identifiers, but one of the attributes you collected lists hair color. If only one of your students has red hair, they could easily be re-identified from the dataset.
In a dataset of millions where many people have red hair, the risk of re-identification would be much lower. However, multiple indirect identifiers can often be combined to easily identify an individual. This means that the problems associated with various indirect identifiers are context specific. You will need to carefully consider your dataset to determine the risks of re-identification through indirect identifiers.
There are a range of different techniques for anonymizing data, and the ideal option will depend on your dataset, the risk of re-identification, and what you want to use it for. Techniques can be broken down across a number of loose categories:
- Removing data – This can involve things like removing the entire record, removing an attribute, substituting it with something else, or masking part of the attribute to make it unidentifiable.
- Pseudonymization – Under Europe’s GDPR, pseudonymous data is data that can only be re-identified using information that is securely stored elsewhere. As an example, the names in a dataset may be replaced by identifying numbers. The mapping of the names to numbers would be protected in another database. Pseudonymization is not the same as anonymization, but in certain contexts it may be a useful way to protect user privacy and comply with regulations.
- Adding statistical noise – You can reduce the chances of re-identification by limiting the accuracy of your data, such as by including age ranges (40-49) instead of specific ages (47), or only including the state instead of a postal code. However, this can make the dataset less useful for certain purposes.
- Aggregation – In certain situations, aggregating and averaging records may give you sufficient insight from your dataset while limiting the risks of re-identification.
Anonymization isn’t always practical
It’s not always possible to anonymize data in a way that still leaves it usable for your intended purposes. On top of this, anonymization is far from foolproof. You may think that you have gone above and beyond in anonymizing your dataset, only to have an attacker re-identify some of the individuals using techniques that hadn’t crossed your mind.
Given this risk, you should always err on the side of caution. If certain types of data, such as Protected Health Information (PHI) are re-identified, they could cause havoc in the individuals’ lives, not to mention huge costs for your organization.