If your organization ever finds itself in the position of wishing or having to disclose personally identifiable information (PII), e.g., to third parties for processing purposes, to researchers for scientific purposes, or to the public as a result of access to information obligation, you have to ensure that the privacy of those individuals to whom the data pertains is adequately protected. The new ISO framework provides guidance on how to do so.
Proper protection of PII contained in datasets that are supposed to be disclosed, however widely, requires an assessment of the context of disclosure, of the data itself, in particular its identifiability, a mitigation of the latter, and data-identification governance, before as well as after the disclosure. The new ISO framework addresses these steps in detail, and we summarize the highlights in this blog.
While ISO standards are voluntary, this framework can be a useful supplement to the requirements of many privacy laws and regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) and the GDPR, which make the de-identification of data mandatory before certain kinds of disclosure.
Context Assessment
Context assessment refers to the determination of a data recipient’s privacy and security posture, the intended use, the level of transparency with which the data will be disclosed, as well as the presumed additional knowledge the data recipient likely has about individuals contained in the dataset. In addition, the context assessment also requires the examination of potential threats.
Threat Modeling
Potential threats can be categorized into three kinds: deliberate, accidental, and environmental. A deliberate attack refers to an attempt to identify individuals in the dataset by an insider to the recipient’s infrastructure, such as an employee. An accident, on the other hand, describes the unintentional disclosure of PII, for example where the data recipient happens to recognize an individual whose data are included in the dataset. An environmental threat refers to loss or theft of data when all IT security controls fail.
The data recipient’s ability to prevent and mitigate the realization of each of these threats should be assessed. For consistency and efficiency, it is recommended that third party audits are undertaken, and relevant certifications considered.
Where data are shared non-publicly, a prudent mitigation strategy involves imposing contractual obligations upon the data recipient that require it, for example, to conduct awareness training, limit use cases, prohibit further sharing of the data, and permit recurrent compliance audits.
Transparency and Impact Assessment
In addition to assessing the data recipient, a context assessment should also include engaging other stakeholders, such as the individuals represented in the data, organizations that disclose similar data, or privacy regulators. Disclosing one’s collection, use, and disclosure practices fosters trust and enables stakeholders to voice concerns which can then lead to an appropriate balance between risks and benefits of disclosure.
Privacy impact assessments are a useful, and often mandatory, mechanism by which privacy risks are identified and mitigation strategies surfaced. The earlier such an impact assessment is undertaken, the better, as at an early stage there are the most possibilities of implementing privacy by design.
Data Assessment
The purpose of the data assessment is to understand the features of the data that is disclosed and what external background information is likely available to an adversary, that is, a person attempting to re-identify the data contained in a dataset. These insights will inform the decision of what data points need to be de-identified to protect privacy and which ones can remain in support of the use case.
Data Features
The ISO frameworks provides a helpful categorization that can be used to structure the data assessment: The dataset is composed of the population unit (usually represented in the rows of a dataset in case of structured data) and the information about that population unit (called attribute).
On the level of the population unit, the organization that intends to disclose the data should consider whether the data represents a single unit, i.e., a person or a household, whether it is an aggregate of several units, whether particularly vulnerable individuals are included, and whether the entire dataset is disclosed, or just a sample, leaving uncertainty on the side of an adversary as to who is represented in the data.
Considering the attributes, it needs to be determined whether they constitute direct or indirect identifiers, their level of uniqueness and sensitivity, whether they can be complemented with other available data, and what the adversary can learn about them through targeted searches. This assessment will show the value of the data to an adversary.
Attack Modelling
The data assessment, in a second step, requires a quantification of the risk. For this purpose, the framework considers only deliberate attacks and further divides them into three scenarios: (1) Prosecutor risk – the adversary knows that the individual they are targeting is included in the dataset, (2) Journalist risk – the adversary does not know whether the targeted individual’s data are included in the dataset, e.g., because only a sample of the complete dataset in which the individual is included is made available, and (3) Marketer’s risk – the attack is not targeted but rather the objective is to identify as many individuals as possible.
Metrics associated with these three attacks are the maximum risk or the average risk. The former is the metric of choice when no security measures are applied, such as when the data are made publicly available. The maximum risk is calculated by considering the risk to the individual that is most likely identified in an attack. This level of prudence is necessary because it must be expected that an attacker will attempt identification even if just for purposes of demonstrating the possibility of doing so.
The average risk metric can be applied if there are additional controls in place to protect the data, so that a demonstration attack is prevented. As the name suggests, in this case the average identifiability across the entire dataset is calculated.
Identifiability Assessment and Mitigation
In the identifiability assessment, the results of the context and data assessment are brought together to quantify the identifiability. The probability of identification of an individual is determined by the probability that identification will occur, assuming there is going to be a threat, e.g., a deliberate attack, times the probability of the occurrence of a threat.
P(identification) = P(identification | threat) × P(threat)
Assessing Identifiability
The result of this function will be a number which can then be compared to well-established identifiability thresholds that are set out in Annex B to the ISO framework. Depending on whether an attack against an entity or a group is being modeled, and depending on how likely and impactful the attack would be, the threshold lies between 0.0005 and 0.1.
Subjective as well as objective factors may be considered when evaluating the impact a successful threat will have on an individual whose data are disclosed. The higher the impact, the lower the threshold should be. However, legitimate benefits of the disclosure can also be taken into account, as well as the reasonable expectation of privacy of the individuals the information of whom is contained in the dataset.
Assumptions made about the identifiability of the data should be subjected to adversarial testing, which is a simulation using a friendly adversary with relative competence and common external resources. While good practice, the ISO concedes that this approach is resource intensive and cannot represent all the possible threats the data may be exposed to.
Mitigation
Mitigating efforts can be directed at the context of disclosure or the data itself, or both. Where applicable, the data recipient may be contractually required to enhance its security practices, to limit access on a strict need-to-know basis, permit the analytical output to be checked by the disclosing organization, etc.
Modifying the data itself will often impact the utility of the data, however, it may be the only mitigation strategy available, depending on the data recipient. The framework thus recommends eliminating all direct identifiers or replacing them with values that are not linked in any way to the original information. For high-risk data, this process should be irreversible.
Indirect identifiers and sensitive information that are not required for the data analysis should also be eliminated or otherwise modified to achieve an acceptable identifiability risk level. The easiest modifications that are at the same time quite effective are generalization and sampling. Generalization describes the reduction of the level of detail by enlarging numerical intervals, e.g., providing an age range rather than the age of an individual, or the combination of several categories of data into one. Sampling means leaving out certain information pertaining to some individuals, introducing uncertainty for the adversary whether a particular individual’s data are contained in the dataset.
Following the mitigation efforts, it is recommended to re-evaluate the data’s identifiability as measured against the chosen threshold.
De-Identification Governance
The ISO framework suggests implementing data sharing policies and procedures into the organization’s wider information security practices. This systematizes the approach to disclosure, makes it repeatable, auditable, and enables the organization to respond most effectively to privacy incidents and breaches.
Before Data are Disclosed
Identifying roles and responsibilities and training staff appropriately ensures that the required expertise exists to disclose data responsibly and in compliance with applicable privacy laws and regulations. A comprehensive record-keeping system that tracks any activities that relate to the organization’s data handling leaves an auditable trail, instilling confidence in the policies and procedures and their implementation. Open and ongoing communication with relevant stakeholders can determine what information the organisation should disclose regarding its data protection activities, without exposing information that would enable adversaries to re-identify the data more effectively.
After Data are Disclosed
After data have been disclosed, it is prudent to regularly reassess the disclosure environment, as technological abilities keep advancing, more information becomes available, and data privacy laws are established or amended regularly.
In support of this, the framework advises keeping track of all the data the organization has disclosed in order to spot any potential linkage between the released datasets, new publicly available data, previous data recipients, new technologies, and developments in the law.
In the event of a privacy incident, breach containment, mitigation, and reporting are paramount. Immediately after, lessons learned should be discussed and any gaps in the policies and procedures should be identified and closed. Creating an audit trail during the breach response phase is also important to demonstrate proper measures were taken and the legal obligations were fulfilled.
Conclusion
The privacy enhancing data de-identification framework – ISO/IEC 27559:2022(E) gives actionable guidelines to organizations that wish to, or are required to disclose data in their custody. The strategies and considerations informing context, data, identifiability assessment, and de-identification governance are generally applicable, yet can easily be adapted to any organization’s particular situation. The level of technicality is kept at a minimum, making the framework comprehensive for many relevant stakeholders.
The technical expertise still needs to come in, of course, and help with determining the precise need for and implementing the actual de-identification. Private AI is well equipped to facilitate the categorization and de-identification of personal data, even in unstructured data and across 49 languages. Using the latest advancements in Machine Learning, the time-consuming work of redacting your data can be minimized and compliance partially automated. To see the tech in action, try our web demo, or request an API key to try it yourself on your own data.