In the first part of this blog series, we discussed data privacy in Germany and the various obstacles associated with redacting Personally Identifiable Information (PII) in the German language. Now, in the second installment, we further explore the multifaceted landscape of German data privacy, shedding light on challenges that emerge not just from linguistic intricacies, but also from the sociocultural and historical contexts that shape the use of the German language across the globe.
Sociocultural Issues / Context
While the challenges with NER detection described above stem from German’s linguistic features, a whole new set of issues arises from the sociocultural and historical context of how and where German is used in different regions of the world.
While Standard German is the institutionally-supported national language of Germany, it is also an official language in Austria, Switzerland, Luxembourg, Liechtenstein, and parts of Belgium. German holds ‘minority language’ or ‘cultural language’ status in Czech Republic, Hungary, Romania, Russia, Slovakia, and areas of Italy, Denmark, Russia, and Brazil, attesting to its significant use by communities in these regions. Because of the history of colonization by the former German Empire and subsequent promotion of German language use in colonized areas, German is also a national language of Namibia and pockets of German usage exist in many African and Micronesian states. On top of this, historical emigration of a German-speaking diaspora means that nearly every continent hosts speakers who use German as a heritage or cultural language. In total, the worldwide use of German by both native speakers and second-language learners totals 103.5 million.
Given how many states, countries, and continents across the globe are home to German speakers, it should be no surprise that a vast degree of variation exists in the many different varieties, or ‘dialects’, of German worldwide. Even among regions that count Standard German as a national language, a local German variety nearly always exists alongside the standardized German variety. In these situations, the two varieties are said to be diglossic, which is a linguistic term for the simultaneous usage of two languages, or two varieties of the same language, by a single community. Diglossia describes a situation in which you hear a standardized language variety (e.g., Hochsprache) used in news broadcasts, at the office, or in educational institutions, but you’ll likely hear the nonstandardized or local language variety (e.g., Bayerisch in Bavaria or Schweizerdeutsch in Switzerland) used among friends and family or in informal social situations.
Because local varieties of German exist nearly everywhere that German is spoken, it’s not hard to imagine the difficulty of data privacy and GDPR compliance in Germany. Further, since local varieties often exist alongside standardized varieties of German, there is a high degree of reidentification risk present in direct and quasi-identifiers dependent upon which local variety of German is used, which local terminology is used for identifiers, or which local format identifiers take. For example: A legal contract using terminology from Austrian Österreichisches Deutsch identifies its writer and signees as Austrian, use of the term Rijksregisternummer to refer to one’s national registration number identifies the user as Belgian, and a written address in German that contains a four-digit postcode beginning with the digits ‘94’ identifies the address as Liechtensteinian. In these cases, the local variety of German, piece of country-specific terminology, or even region-specific format are all themselves quasi-identifiers. Careful and diligent documentation of these differences allows an entity detection solution to capture the PII present in a given text and aid GDPR compliance without resulting in the gaps that would be left by a system optimized for Standard German alone.
De-Identification Under the GDPR
We spoke a lot about the difficulty of rendering German text GDPR compliant by redacting personal identifiers. Let’s unpack that and look at the regulatory requirements that are relevant in this context.
First of all, it is helpful to recognize that there is an entire spectrum of how data can be de-identified, with irreversible anonymization at the farthest end. In fact, once data is anonymized, it does not fall under the GDPR.
Still subject to the GDPR, but less stringently protected than identifiable data, is pseudonymised data which is personal data that is not attributable to a specific individual without the use of additional information. This additional information must be kept separate and subjected to technical and organizational safeguards. Pseudonymizing personal data allows its processing for archiving purposes in the public interest, scientific or historical research purposes, or statistical purposes.
There is also a third category of de-identified data that we will refer to as Article 11 data. Article 11(2) contemplates the situation where “the controller is able to demonstrate that it is not in a position to identify the data subject” to whom the personal data pertains. In these instances, the controller is released from several obligations under the GDPR, that is, the data subject has no right to access, rectify, erase, or restrict the processing of this data, and the right to portability of the data subject is also precluded.
All three types of de-identification mentioned in the GDPR can be achieved by manipulating the original data in various ways. The new ISO de-identification framework we wrote about in this blog provides useful guidance on some of the methods that can be used. They all have in common that as a first step, the PII present in a dataset must be identified. This sounds easy enough, particularly if you are dealing with a structured dataset where your columns are explicitly labeled with the type of data that they contain: e.g., name, date of birth, ZIP code, gender, etc. However, when your dataset contains unstructured data, such as medical notes, call transcripts, emails, meeting minutes, other free text, audio, or images, the identification exercise is prone to errors.
This is where technology can help. As we’ve explained here, redaction in the German language poses challenges even for powerful machine learning tools. Consequently, if you consider acquiring technology to help with PII identification and redaction, you must pay attention to whether it has been optimized for the languages you will encounter in your dataset. If, on the other hand, you wish to build a solution yourself, be sure to add linguistic pitfalls to the list of difficulties you expect to face when trying to achieve high accuracy for PII identification.
While we’ve focused here on the challenges of redaction across varieties of German, Private AI has the necessary in-house expertise to train entity identification and redaction models in many different languages. So far, it’s 52 and counting. To see the tech in action, try our web demo, or get a free API key to try it yourself on your own data.