Identify, Redact, and Replace Personally Identifiable Information in Unstructured Text

Your Data, Your Way

We apply the latest advancements in transformer architectures to pick up PII based entirely on context, which makes us particularly effective on semi-structured and unstructured data. 

With Private AI data, security, and machine learning teams can: 

Built by experts from:

Built by experts from:

How It Works

Private AI is deployed via a single container on-prem so you can easily add our powerful redaction capabilities to any data workflow. The container is accessed via a REST API and can be easily customizable depending on your team’s needs.

PII IDENTIFICATION

Privacy is More Than a ML Model​

Private AI detects more than 50 different entity types of personally identifiable information (PII) across 52 languages. Using our contextually aware ML models, we go beyond traditional entity detection to recognize many different kinds of direct and quasi-identifiers. 

What is and isn’t PII gets complicated, and Private AI’s team of privacy experts ensures our system works in compliance with major legislation like GDPR, CPRA, and HIPAA.

Private AI can be easily implemented as a filter to screen for PII in any data flow or database.

				
					
{
  "result": "Hi [NAME_1], [NAME_2] this side. It's been a while since we last met in [LOCATION_CITY_1].",
  "result_fake": null,
  "pii": [
    {
      "marker": "NAME_1",
      "text": "John",
      "best_label": "NAME",
      "stt_idx": 3,
      "end_idx": 7,
      "labels": {
        "NAME": 0.8446
      }
    },
    {
      "marker": "NAME_2",
      "text": "Grace",
      "best_label": "NAME",
      "stt_idx": 9,
      "end_idx": 14,
      "labels": {
        "NAME": 0.8399
      }
    },
    {
      "marker": "LOCATION_CITY_1",
      "text": "Berlin",
      "best_label": "LOCATION_CITY",
      "stt_idx": 63,
      "end_idx": 69,
      "labels": {
        "LOCATION_CITY": 0.8778,
        "LOCATION": 0.8512
      }
    }
  ],
  "api_calls_used": 1,
  "output_checks_passed": true
}
				
			
redaction__b4redaction_after

TEXT DE-IDENTIFICATION 

Redact at Higher Than Human Accuracy 

Private AI can replace all the PII detected with unique identifiers (ie. NAME_1, CVV_3, CREDIT_CARD_2) to produce redacted transcripts or de-identified data. Alternatively, replace PII with a mask character. Look at our docs to learn more.

Unrivalled Accuracy

SYNTHETIC PII GENERATION

Never Use Transformers Without Privacy Mitigation​

After PII is removed, Private AI can generate synthetic PII to replace all the PII  found with fake data that fits the surrounding context.

The synthetic PII generator never sees the original data, eliminating sensitive data leakage. The resulting text further reduces re-identification risk, as an adversary must first identify what PII is real. Good luck finding a piece of straw in a pile of hay!

Taking production data and replacing all PII with synthetic data also minimizes data shift from the original data, which is highly beneficial when creating ML models

spd__b4spd__aftr
tokenizationbeforetokenizationafter

TOKENIZATION & PSEUDONYMIZATION

Reverse PII Removal As Needed

Replace PII with encrypted tokens using Private AI’s tokenization feature. Sometimes referred to as pseudonymization, tokenization preserves the utility of the data while still protecting what’s sensitive.

Tokenization is reversible, allowing you to easily recover the original data. Contact us for documentation and access.

Try It Free Today

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.