GDPR in Germany: Challenges of German Data Privacy (Part 2)

Written by: Lisa Amdan Schlegl, Kathrin Gardhouse, Irina Presnyakova, Heather Stephens, Fiona Wilson  |  November 7, 2023

Share This Post

In the first part of this blog series, we discussed data privacy in Germany and the various obstacles associated with redacting Personally Identifiable Information (PII) in the German language. Now, in the second installment, we further explore the multifaceted landscape of German data privacy, shedding light on challenges that emerge not just from linguistic intricacies, but also from the sociocultural and historical contexts that shape the use of the German language across the globe.

Sociocultural Issues / Context

While the challenges with NER detection described above stem from German’s linguistic features, a whole new set of issues arises from the sociocultural and historical context of how and where German is used in different regions of the world. 

While Standard German is the institutionally-supported national language of Germany, it is also an official language in Austria, Switzerland, Luxembourg, Liechtenstein, and parts of Belgium. German holds ‘minority language’ or ‘cultural language’ status in Czech Republic, Hungary, Romania, Russia, Slovakia, and areas of Italy, Denmark, Russia, and Brazil, attesting to its significant use by communities in these regions. Because of the history of colonization by the former German Empire and subsequent promotion of German language use in colonized areas, German is also a national language of Namibia and pockets of German usage exist in many African and Micronesian states. On top of this, historical emigration of a German-speaking diaspora means that nearly every continent hosts speakers who use German as a heritage or cultural language. In total, the worldwide use of German by both native speakers and second-language learners totals 103.5 million.

Given how many states, countries, and continents across the globe are home to German speakers, it should be no surprise that a vast degree of variation exists in the many different varieties, or ‘dialects’, of German worldwide. Even among regions that count Standard German as a national language, a local German variety nearly always exists alongside the standardized German variety. In these situations, the two varieties are said to be diglossic, which is a linguistic term for the simultaneous usage of two languages, or two varieties of the same language, by a single community. Diglossia describes a situation in which you hear a standardized language variety (e.g., Hochsprache) used in news broadcasts, at the office, or in educational institutions, but you’ll likely hear the nonstandardized or local language variety (e.g., Bayerisch in Bavaria or Schweizerdeutsch in Switzerland) used among friends and family or in informal social situations.

Because local varieties of German exist nearly everywhere that German is spoken, it’s not hard to imagine  the difficulty of data privacy and GDPR compliance in Germany. Further, since local varieties often exist alongside standardized varieties of German, there is a high degree of reidentification risk present in direct and quasi-identifiers dependent upon which local variety of German is used, which local terminology is used for identifiers, or which local format identifiers take. For example: A legal contract using terminology from Austrian Österreichisches Deutsch identifies its writer and signees as Austrian, use of the term Rijksregisternummer to refer to one’s national registration number identifies the user as Belgian, and a written address in German that contains a four-digit postcode beginning with the digits ‘94’ identifies the address as Liechtensteinian. In these cases, the local variety of German, piece of country-specific terminology, or even region-specific format are all themselves quasi-identifiers. Careful and diligent documentation of these differences allows an entity detection solution to capture the PII present in a given text and aid GDPR compliance without resulting in the gaps that would be left by a system optimized for Standard German alone.

De-Identification Under the GDPR

We spoke a lot about the difficulty of rendering German text GDPR compliant by redacting personal identifiers. Let’s unpack that and look at the regulatory requirements that are relevant in this context. 

First of all, it is helpful to recognize that there is an entire spectrum of how data can be de-identified, with irreversible anonymization at the farthest end. In fact, once data is anonymized, it does not fall under the GDPR. 

Still subject to the GDPR, but less stringently protected than identifiable data, is pseudonymised data which is personal data that is not attributable to a specific individual without the use of additional information. This additional information must be kept separate and subjected to technical and organizational safeguards. Pseudonymizing personal data allows its processing for archiving purposes in the public interest, scientific or historical research purposes, or statistical purposes. 

There is also a third category of de-identified data that we will refer to as Article 11 data. Article 11(2) contemplates the situation where “the controller is able to demonstrate that it is not in a position to identify the data subject” to whom the personal data pertains. In these instances, the controller is released from several obligations under the GDPR, that is, the data subject has no right to access, rectify, erase, or restrict the processing of this data, and the right to portability of the data subject is also precluded.

All three types of de-identification mentioned in the GDPR can be achieved by manipulating the original data in various ways. The new ISO de-identification framework we wrote about in this blog provides useful guidance on some of the methods that can be used. They all have in common that as a first step, the PII present in a dataset must be identified. This sounds easy enough, particularly if you are dealing with a structured dataset where your columns are explicitly labeled with the type of data that they contain: e.g., name, date of birth, ZIP code, gender, etc. However, when your dataset contains unstructured data, such as medical notes, call transcripts, emails, meeting minutes, other free text, audio, or images, the identification exercise is prone to errors. 

This is where technology can help. As we’ve explained here, redaction in the German language poses challenges even for powerful machine learning tools. Consequently, if you consider acquiring technology to help with PII identification and redaction, you must pay attention to whether it has been optimized for the languages you will encounter in your dataset. If, on the other hand, you wish to build a solution yourself, be sure to add linguistic pitfalls to the list of difficulties you expect to face when trying to achieve high accuracy for PII identification.

While we’ve focused here on the challenges of redaction across varieties of German, Private AI has the necessary in-house expertise to train entity identification and redaction models in many different languages. So far, it’s 52 and counting. To see the tech in action, try our web demo, or get a free API key to try it yourself on your own data.

Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Privacy Management
Blog

End-to-end Privacy Management

End-to-end privacy management refers to the process of protecting sensitive data throughout its entire lifecycle, from the moment it is collected to the point where

Read More »

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.