Privacy Enhancing Data De-Identification Framework – ISO/IEC 27559:2022(E)

ISO

Share This Post

If your organization ever finds itself in the position of wishing or having to disclose personally identifiable information (PII), e.g., to third parties for processing purposes, to researchers for scientific purposes, or to the public as a result of access to information obligation, you have to ensure that the privacy of those individuals to whom the data pertains is adequately protected. The new ISO framework provides guidance on how to do so. 

Proper protection of PII contained in datasets that are supposed to be disclosed, however widely, requires an assessment of the context of disclosure, of the data itself, in particular its identifiability, a mitigation of the latter, and data-identification governance, before as well as after the disclosure. The new ISO framework addresses these steps in detail, and we summarize the highlights in this blog. 

While ISO standards are voluntary, this framework can be a useful supplement to the requirements of many privacy laws and regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) and the GDPR, which make the de-identification of data mandatory before certain kinds of disclosure.

Context Assessment

Context assessment refers to the determination of a data recipient’s privacy and security posture, the intended use, the level of transparency with which the data will be disclosed, as well as the presumed additional knowledge the data recipient likely has about individuals contained in the dataset. In addition, the context assessment also requires the examination of potential threats.

Threat Modeling

Potential threats can be categorized into three kinds: deliberate, accidental, and environmental. A deliberate attack refers to an attempt to identify individuals in the dataset by an insider to the recipient’s infrastructure, such as an employee. An accident, on the other hand, describes the unintentional disclosure of PII, for example where the data recipient happens to recognize an individual whose data are included in the dataset. An environmental threat refers to loss or theft of data when all IT security controls fail.

The data recipient’s ability to prevent and mitigate the realization of each of these threats should be assessed. For consistency and efficiency, it is recommended that third party audits are undertaken, and relevant certifications considered.

Where data are shared non-publicly, a prudent mitigation strategy involves imposing contractual obligations upon the data recipient that require it, for example, to conduct awareness training, limit use cases, prohibit further sharing of the data, and permit recurrent compliance audits.

Transparency and Impact Assessment

In addition to assessing the data recipient, a context assessment should also include engaging other stakeholders, such as the individuals represented in the data, organizations that disclose similar data, or privacy regulators. Disclosing one’s collection, use, and disclosure practices fosters trust and enables stakeholders to voice concerns which can then lead to an appropriate balance between risks and benefits of disclosure. 

Privacy impact assessments are a useful, and often mandatory, mechanism by which privacy risks are identified and mitigation strategies surfaced. The earlier such an impact assessment is undertaken, the better, as at an early stage there are the most possibilities of implementing privacy by design.

Data Assessment

The purpose of the data assessment is to understand the features of the data that is disclosed and what external background information is likely available to an adversary, that is, a person attempting to re-identify the data contained in a dataset. These insights will inform the decision of what data points need to be de-identified to protect privacy and which ones can remain in support of the use case. 

Data Features

The ISO frameworks provides a helpful categorization that can be used to structure the data assessment: The dataset is composed of the population unit (usually represented in the rows of a dataset in case of structured data) and the information about that population unit (called attribute). 

On the level of the population unit, the organization that intends to disclose the data should consider whether the data represents a single unit, i.e., a person or a household, whether it is an aggregate of several units, whether particularly vulnerable individuals are included, and whether the entire dataset is disclosed, or just a sample, leaving uncertainty on the side of an adversary as to who is represented in the data.

Considering the attributes, it needs to be determined whether they constitute direct or indirect identifiers, their level of uniqueness and sensitivity, whether they can be complemented with other available data, and what the adversary can learn about them through targeted searches. This assessment will show the value of the data to an adversary.

Attack Modelling

The data assessment, in a second step, requires a quantification of the risk. For this purpose, the framework considers only deliberate attacks and further divides them into three scenarios: (1) Prosecutor risk – the adversary knows that the individual they are targeting is included in the dataset, (2) Journalist risk – the adversary does not know whether the targeted individual’s data are included in the dataset, e.g., because only a sample of the complete dataset in which the individual is included is made available, and (3) Marketer’s risk – the attack is not targeted but rather the objective is to identify as many individuals as possible.

Metrics associated with these three attacks are the maximum risk or the average risk. The former is the metric of choice when no security measures are applied, such as when the data are made publicly available. The maximum risk is calculated by considering the risk to the individual that is most likely identified in an attack. This level of prudence is necessary because it must be expected that an attacker will attempt identification even if just for purposes of demonstrating the possibility of doing so. 

The average risk metric can be applied if there are additional controls in place to protect the data, so that a demonstration attack is prevented. As the name suggests, in this case the average identifiability across the entire dataset is calculated.

Identifiability Assessment and Mitigation

In the identifiability assessment, the results of the context and data assessment are brought together to quantify the identifiability. The probability of identification of an individual is determined by the probability that identification will occur, assuming there is going to be a threat, e.g., a deliberate attack, times the probability of the occurrence of a threat.

P(identification) = P(identification | threat) × P(threat) 

Assessing Identifiability

The result of this function will be a number which can then be compared to well-established identifiability thresholds that are set out in Annex B to the ISO framework. Depending on whether an attack against an entity or a group is being modeled, and depending on how likely and impactful the attack would be, the threshold lies between 0.0005 and 0.1.

Subjective as well as objective factors may be considered when evaluating the impact a successful threat will have on an individual whose data are disclosed. The higher the impact, the lower the threshold should be. However, legitimate benefits of the disclosure can also be taken into account, as well as the reasonable expectation of privacy of the individuals the information of whom is contained in the dataset.

Assumptions made about the identifiability of the data should be subjected to adversarial testing, which is a simulation using a friendly adversary with relative competence and common external resources. While good practice, the ISO concedes that this approach is resource intensive and cannot represent all the possible threats the data may be exposed to.

Mitigation

Mitigating efforts can be directed at the context of disclosure or the data itself, or both. Where applicable, the data recipient may be contractually required to enhance its security practices, to limit access on a strict need-to-know basis, permit the analytical output to be checked by the disclosing organization, etc. 

Modifying the data itself will often impact the utility of the data, however, it may be the only mitigation strategy available, depending on the data recipient. The framework thus recommends eliminating all direct identifiers or replacing them with values that are not linked in any way to the original information. For high-risk data, this process should be irreversible. 

Indirect identifiers and sensitive information that are not required for the data analysis should also be eliminated or otherwise modified to achieve an acceptable identifiability risk level. The easiest modifications that are at the same time quite effective are generalization and sampling. Generalization describes the reduction of the level of detail by enlarging numerical intervals, e.g., providing an age range rather than the age of an individual, or the combination of several categories of data into one. Sampling means leaving out certain information pertaining to some individuals, introducing uncertainty for the adversary whether a particular individual’s data are contained in the dataset.

Following the mitigation efforts, it is recommended to re-evaluate the data’s identifiability as measured against the chosen threshold.

De-Identification Governance

The ISO framework suggests implementing data sharing policies and procedures into the organization’s wider information security practices. This systematizes the approach to disclosure, makes it repeatable, auditable, and enables the organization to respond most effectively to privacy incidents and breaches.

Before Data are Disclosed

Identifying roles and responsibilities and training staff appropriately ensures that the required expertise exists to disclose data responsibly and in compliance with applicable privacy laws and regulations. A comprehensive record-keeping system that tracks any activities that relate to the organization’s data handling leaves an auditable trail, instilling confidence in the policies and procedures and their implementation. Open and ongoing communication with relevant stakeholders can determine what information the organisation should disclose regarding its data protection activities, without exposing information that would enable adversaries to re-identify the data more effectively.

After Data are Disclosed

After data have been disclosed, it is prudent to regularly reassess the disclosure environment, as technological abilities keep advancing, more information becomes available, and data privacy laws are established or amended regularly. 

In support of this, the framework advises keeping track of all the data the organization has disclosed in order to spot any potential linkage between the released datasets, new publicly available data, previous data recipients, new technologies, and developments in the law. 

In the event of a privacy incident, breach containment, mitigation, and reporting are paramount. Immediately after, lessons learned should be discussed and any gaps in the policies and procedures should be identified and closed. Creating an audit trail during the breach response phase is also important to demonstrate proper measures were taken and the legal obligations were fulfilled. 

Conclusion

The privacy enhancing data de-identification framework – ISO/IEC 27559:2022(E) gives actionable guidelines to organizations that wish to, or are required to disclose data in their custody. The strategies and considerations informing context, data, identifiability assessment, and de-identification governance are generally applicable, yet can easily be adapted to any organization’s particular situation. The level of technicality is kept at a minimum, making the framework comprehensive for many relevant stakeholders. 

The technical expertise still needs to come in, of course, and help with determining the precise need for and implementing the actual de-identification. Private AI is well equipped to facilitate the categorization and de-identification of personal data, even in unstructured data and across 49 languages. Using the latest advancements in Machine Learning, the time-consuming work of redacting your data can be minimized and compliance partially automated. To see the tech in action, try our web demo, or request an API key to try it yourself on your own data.

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.