Creating De-Identified Embeddings

Embeddings are an increasingly popular data science tool, used across the industry for various solutions.
What are Embeddings?
Embeddings essentially a numerical representation of data, used to determine the relationship (if any) between entities. For more information on embeddings, see here.
One popular use for embeddings is to send content to LLMs that they haven’t been trained on. Passing an entire file is often too large for the prompt of the LLM, so only segments of the file can be sent, along with a question or direction for the LLM. Using embeddings, the most relevant parts of the file can be sent in the prompt, so the LLM gets the best context to answer the question.
In this article, we’ll show you how use de-identified embeddings to get meaningful context, while adding a layer of privacy to keep your data safe.
Getting Started
To test out how embeddings work, we’ll be using a summary of each group’s performance during the 2022 World Cup, taken from Wikipedia.
Main article: 2022 FIFA World Cup Group A
The first match of the tournament was held between Qatar and Ecuador in Group A. Ecuador had a disallowed goal in the opening minutes, but eventually won 2–0 with two goals from Enner Valencia. Qatar became the first host nation to lose their opening match at a World Cup. Many Qatar natives were seen leaving the game before the end, with ESPN reporting that two-thirds of the attendance had left. The other starting match in group A was won by the Netherlands 2–0 over Senegal. Cody Gakpo scored the opening goal in the 84th minute and Davy Klaassen added a second in stoppage time. Senegal faced Qatar in the third match of the group; Boulaye Dia capitalised on a slip by Boualem Khoukhi to put Senegal 1–0 ahead. Famara Diédhiou scored a second with a header, before Mohammed Muntari scored Qatar's first-ever goal at a World Cup to reduce the deficit back to one. Senegal eventually won the match 3–1 after an 84th-minute goal by Bamba Dieng. With this result, Qatar became the first team to be eliminated from the tournament, as well as becoming the first host nation to ever be knocked out of the tournament after two games. Gakpo scored his second goal of the tournament as the Netherlands led Ecuador; however, Valencia scored an equaliser in the 49th minute. The Netherlands won 2–0 against Qatar following goals by Gakpo and Frenkie de Jong to win the group, while Qatar attained the distinction of being the first home nation to lose all three group matches. Senegal faced Ecuador to determine the second knockout round qualifier. At the end of the first half, Ismaïla Sarr scored a penalty kick to put Senegal ahead. In the 67th minute, Moisés Caicedo scored an equaliser, but shortly after, Kalidou Koulibaly gave Senegal the victory. The win was enough to qualify Senegal as the runners-up of Group A.
Main article: 2022 FIFA World Cup Group B
England completed a 6–2 victory over Iran. Iranian keeper Alireza Beiranvand was removed from the game for a suspected concussion before England scored three first-half goals. Mehdi Taremi scored in the second half after which England defender Harry Maguire was also removed for a concussion. Timothy Weah, of the United States, scored a first-half goal against Wales; however, the match finished as a draw after a penalty kick was won and scored by Gareth Bale. Iran defeated Wales 2–0 following a red card to Welsh goalkeeper Wayne Hennessey after he committed a foul outside of his penalty area. Substitute Rouzbeh Cheshmi scored the first goal eight minutes into stoppage time, followed by Ramin Rezaeian scoring three minutes later. England and the United States played to a 0–0 draw, with only four shots on target between them. England won the group following a 3–0 win over Wales with a goal by Phil Foden and two by Rashford. Christian Pulisic scored the winning goal as the United States defeated Iran 1–0 to qualify for the round of 16.
Main article: 2022 FIFA World Cup Group C
Argentina took an early lead against Saudi Arabia after Lionel Messi scored a penalty kick after ten minutes; however, second-half goals by Saleh Al-Shehri and Salem Al-Dawsari won the match 2–1 for Saudi Arabia, a result described as "the biggest upset in the history of the World Cup." The match between Mexico and Poland ended as a goalless 0–0 draw after Guillermo Ochoa saved Robert Lewandowski's penalty kick attempt. Lewandowski scored his first career World Cup goal in a 2–0 win over Saudi Arabia four days later. Argentina defeated Mexico 2–0, with Messi scoring the opener and later assisting teammate Enzo Fernández who scored his first international goal. Argentina won their last game as they played Poland with goals by Alexis Mac Allister and Julián Álvarez which was enough to win the group; Poland qualified for the knockout stage on goal difference.
Main article: 2022 FIFA World Cup Group D
The match between Denmark and Tunisia ended as a goalless draw; both teams had goals disallowed by offside calls. Danish midfielder Christian Eriksen made his first major international appearance since suffering a cardiac arrest at the UEFA Euro 2020. Defending champions France went a goal behind to Australia, after a Craig Goodwin goal within ten minutes. France, however, scored four goals, by Adrien Rabiot, Kylian Mbappé and two by Olivier Giroud to win 4–1. The goals tied Giroud with Thierry Henry as France's all-time top goalscorer. Mitchell Duke scored the only goal as Australia won against Tunisia. This was their first World Cup win since 2010. Mbappé scored a brace as France defeated Denmark 2–1. This was enough for France to qualify for the knockout round – the first time since Brazil in 2006 that the defending champions progressed through the opening round. Mathew Leckie scored the only goal as Australia defeated Denmark 1–0, qualifying for the knockout round as runners-up with the win. Wahbi Khazri scored for Tunisia against France in the 58th minute. Although Antoine Griezmann equalised in stoppage time it was overturned for offside. Tunisia finished third in the group, as they required a draw in the Denmark and Australia game.
Main article: 2022 FIFA World Cup Group E
Group E began with Japan facing 2014 champions Germany. After an early penalty kick was converted by Germany's İlkay Gündoğan, Japan scored two second-half goals by Ritsu Dōan and Takuma Asano in a 2–1 upset win. In the second group match, Spain defeated Costa Rica 7–0. First-half goals by Dani Olmo, Marco Asensio, and Ferran Torres were followed by goals by Gavi, Carlos Soler, Alvaro Morata, and a second by Torres. This was the largest defeat in a World Cup since Portugal's victory over North Korea in the 2010 event by the same scoreline. Costa Rica defeated Japan 1–0, with Keysher Fuller scoring with Costa Rica's first shot on target of the tournament. Germany and Spain drew 1–1, with Álvaro Morata scoring for Spain and Niclas Füllkrug scoring for Germany. Morata scored the opening goal for Spain against Japan as they controlled the first half of the match. Japan equalised on Ritsu Doan before a second goal by Kaoru Mitoma was heavily investigated by VAR for the ball being out of play. The goal was awarded, and Japan won the group following a 2–1 win. Serge Gnabry scored on ten minutes for Germany against Costa Rica and they led until half-time. Germany required a win, and for Japan to not win their match, or for both teams to win their matches by a combined goal difference of at least 9 goals, to qualify. In the second half, goals by Yeltsin Tejeda and Juan Vargas gave Costa Rica a 2–1 lead, which would have qualified them into the knockout stages ahead of Spain. Germany scored three further goals—two by Kai Havertz and a goal by Niclas Fullkrug, ending in a 4–2 win for Germany—which was not enough to qualify them for the final stages. Japan won the group ahead of Spain.
Main article: 2022 FIFA World Cup Group F
Group F's first match was a goalless draw between Morocco and Croatia. Canada had a penalty kick in the first half of their match against Belgium which was saved by Thibaut Courtois. Belgium won the match by a single goal by Michy Batshuayi. Belgium manager Roberto Martínez confirmed after the game that he believed Canada to have been the better team. Belgium lost 2–0 to Morocco, despite Morocco having a long-range direct free kick goal by Hakim Ziyech overturned for an offside on another player in the lead up to the goal. Two second-half goals from Zakaria Aboukhlal and Romain Saïss helped the Morocco win their first World Cup match since 1998. The match sparked riots in Belgium, with residents fires and fireworks being set off. Alphonso Davies scored Canada's first World Cup goal to give Canada the lead over Croatia. Goals by Marko Livaja, Lovro Majer, and two by Andrej Kramarić for Croatia completed a 4–1 victory. Morocco scored two early goals through Hakim Ziyech and Youssef En-Nesyri in their game against Canada and qualified following a 2–1 victory. Canada's only goal was an own goal by Nayef Aguerd. Croatia and Belgium played a goalless draw which eliminated Belgium, whose team was ranked second in the world, from the tournament.
Main article: 2022 FIFA World Cup Group G
Breel Embolo scored the only goal in Switzerland's 1–0 defeat of Cameroon. Richarlison scored two goals as Brazil won against Serbia, with star player Neymar receiving an ankle injury. Cameroon's Jean-Charles Castelletto scored the opening goal against Serbia, but they were quickly behind as Serbia scored three goals by Strahinja Pavlović, Sergej Milinković-Savić, and Aleksandar Mitrović either side of half time. Cameroon, however, scored goals through Vincent Aboubakar and Eric Maxim Choupo-Moting, completing a 3–3 draw. An 83rd-minute winner by Casemiro for Brazil over Switzerland was enough for them to qualify for the knockout stage. Having already qualified, Brazil were unable to win their final group game, as they were defeated by Cameroon 1–0 following a goal by Vincent Aboubakar. He was later sent off for removing his shirt in celebrating the goal. Cameroon, however, did not qualify, as Switzerland defeated Serbia 3–2.
Main article: 2022 FIFA World Cup Group H
Uruguay and South Korea played to a goalless draw. A goalless first half between Portugal and Ghana preceded a penalty converted by Cristiano Ronaldo to give Portugal the lead. In scoring the goal, Ronaldo became the first man to score in five World Cups. Ghana responded with a goal by André Ayew before goals by João Félix, and Rafael Leão by Portugal put them 3–1 ahead. Osman Bukari scored in the 89th minute to trail by a single goal, while Iñaki Williams had a chance to equalise for Ghana ten minutes into stoppage time, but slipped before shooting. The match finished 3–2 to Portugal. Ghanaian Mohammed Salisu opened the scoring against South Korea, with Mohammed Kudus following it up. In the second half, Cho Gue-sung scored a brace for South Korea, levelling the score. Mohammed Kudus scored again in the 68th minute, winning the match 3–2 for Ghana. Portugal defeated Uruguay 2–0 with two goals from Bruno Fernandes, advancing them to the knockout stage. The game's first goal appeared to have been headed in by Ronaldo, but the ball just missed his head. A controversial penalty decision was called late in the game, with a suspected handball from José María Giménez. Portugal led South Korea through Ricardo Horta after 10 minutes. However, goals by Kim Young-gwon and Hwang Hee-chan won the match 2–1 for South Korea. Giorgian de Arrascaeta scored two goals as Uruguay defeated Ghana 2–0. However, with South Korea winning, Uruguay required another goal to progress as they finished third on goals scored. Several Uruguay players left the pitch after the game surrounding the referees and followed them off the pitch.				
In order to create our solution we first need to setup our development environment and ensure we can connect to all the necessary services.
Private AI Service
#1 - If you don’t already have access to the Private AI deidentification service, see our guide on getting setup with AWS
#2 - If you don’t already have access to the Private AI deidentification service, you can request a free api key.
OpenAI
An OpenAI API key is needed to capture the text embeddings for our document. You can sign up here.
Python Environment
We’ll be coding this solution with Python. If you don’t have a python environment setup, see the official Python for Beginners guide to get setup quick and easy.
Installing Dependencies
We’ll need several python modules to get started.
In order receive embeddings for our input, we need access to an embedding model. OpenAI’s ADA-002 is a good example. We’ll install the OpenAI python client so requests can be made easily.
pip install openai	
We’ll also install Private AI’s python client for easy text deidentification.
pip install privateai_client
Scipy will help determine the relation between our questions and document.
pip install scipyAnd we’ll store our data in Pandas, for easy retrieval.
pip install pandas	
Now that the environment is all setup, we’re ready to get coding. Let’s setup an initial script to get the sample data in the form of a panda dataframe.
import openai
import os
import pandas as pd
from privateai_client import PAIClient, request_objects as rq
def main():
    filepath= "world_cup.txt"
    delimiter = "\n\n"
    dataframe = get_dataframe(filepath, delimiter)
if __name__ == "__main__":
    main()The text from the file needs to be split into chunks in order for the script to only use the most relatable sections for the questions being asked. This can be done a variety of ways (such as counting the tokens in the document), but for simplicity we’re going to split the data with a delimiter. Each group in the sample data has been separated by 2 newline characters, so we’ll use that as a delimiter for creating the chunks.
def get_dataframe(filepath: str, delimiter: str):
with open(filepath, "r") as fin:
    text = fin.read().split(delimiter)
    data = {"Text":text}
    return pd.DataFrame(data)				
Now that we have the data from the sample document, we need to get and store the embeddings for each entry. Let’s add a function to get the embeddings from OpenAI and store them in our data frame, next to the associated text.
First we’ll add our embedding function:
def get_embeddings(input_text: str):
model = "text-embedding-ada-002"
return openai.Embedding.create(input=input_text, model=model)["data"][0]["embedding"]Then we’ll update the get_dataframe function to get and store the embeddings for each chunk of text.
def get_dataframe(filename: str, delimiter: str):
with open(filename, "r") as fin:
    text = fin.read().split(delimiter)
    embeddings = [get_embeddings(row) for row in text]
    data = {"Text":text, "Embedding": embeddings}
    return pd.DataFrame(data)Our data is ready to be used! The dataframe contains both the text and embeddings:
Text
0  Main article: 2022 FIFA World Cup Group A\nThe...
1  Main article: 2022 FIFA World Cup Group B\nEng...
2  Main article: 2022 FIFA World Cup Group C\nArg...
3  Main article: 2022 FIFA World Cup Group D\nThe...
4  Main article: 2022 FIFA World Cup Group E\nGro...
5  Main article: 2022 FIFA World Cup Group F\nGro...
6  Main article: 2022 FIFA World Cup Group G\nBre...
7  Main article: 2022 FIFA World Cup Group H\nUru...
Embedding
0  [-0.0021765846759080887, 0.0007000854820944369...
1  [-0.011230451986193657, 0.0040274495258927345,...
2  [-0.01025689858943224, 0.0030970079824328423, ...
3  [-0.021670430898666382, -0.0150331174954772, 0...
4  [-0.008809792809188366, 0.0010364461923018098,...
5  [-0.014114190824329853, -0.02050258219242096, ...
6  [-0.01570366881787777, -0.00018986705981660634...
7  [-0.011951795779168606, -0.00752691924571991, ...
At this point we need to be able to add questions and have those questions compared to the data to see how relevant it is. Let’s add an input loop to the main function, and a function to find the most relevant chunk of data for our question.
def main():
   openai.api_key = os.environ['OPENAI_API_KEY']
   filename = "world_cup.txt"
   delimiter = "\n\n"
   dataframe = get_dataframe(filename, delimiter)
   while True:
      message = input("Input question to find relevant text: ")
      if message == 'quit':
         break
      related_text = get_related_text(message, dataframe)
      print(f"based on the query\n\n{message}\n\n the most relevant part of the file		is\n\n{related_text[0]}\n\n with a relatedness of {related_text[1]}")The get_related_text function needs several steps:
- Get the embedding for the question being asked
- Compare the embedding of the question to the chunks of text from the file to find the best match
def get_related_text(query: str, dataframe: pd.DataFrame):
   query_embedding = get_embeddings(query)
   best_match = ''
   best_relatedness = 0
   for i, row in dataframe.iterrows():
      relatedness = get_relatedness(query_embedding, row['Embedding'])
      if relatedness > best_relatedness:
         best_relatedness = relatedness
         best_match = row['Text']
   return (best_match, best_relatedness)
We’ll add one more function to find the numerical relatedness of the chunk and question:
def get_relatedness(query_embedding, file_embedding):
   return 1 - spatial.distance.cosine(query_embedding, file_embedding)Great, now the script is ready to run!
Let’s test it out with a question:
Did any groups have a goalless starting match?Output:
based on the query
Did any groups have a goalless starting match?
the most relevant part of the raw file is
Main article: 2022 FIFA World Cup Group H
Uruguay and South Korea played to a goalless draw. A goalless first half between Portugal and Ghana preceded a penalty converted by Cristiano Ronaldo to give Portugal the lead. In scoring the goal, Ronaldo became the first man to score in five World Cups. Ghana responded with a goal by André Ayew before goals by João Félix, and Rafael Leão by Portugal put them 3–1 ahead. Osman Bukari scored in the 89th minute to trail by a single goal, while Iñaki Williams had a chance to equalise for Ghana ten minutes into stoppage time, but slipped before shooting. The match finished 3–2 to Portugal. Ghanaian Mohammed Salisu opened the scoring against South Korea, with Mohammed Kudus following it up. In the second half, Cho Gue-sung scored a brace for South Korea, levelling the score. Mohammed Kudus scored again in the 68th minute, winning the match 3–2 for Ghana. Portugal defeated Uruguay 2–0 with two goals from Bruno Fernandes, advancing them to the knockout stage. The game's first goal appeared to have been headed in by Ronaldo, but the ball just missed his head. A controversial penalty decision was called late in the game, with a suspected handball from José María Giménez. Portugal led South Korea through Ricardo Horta after 10 minutes. However, goals by Kim Young-gwon and Hwang Hee-chan won the match 2–1 for South Korea. Giorgian de Arrascaeta scored two goals as Uruguay defeated Ghana 2–0. However, with South Korea winning, Uruguay required another goal to progress as they finished third on goals scored. Several Uruguay players left the pitch after the game surrounding the referees and followed them off the pitch.
 with a relatedness of 0.8222459954234642
The script has determined a correct group!
Adding Privacy
The script is working as intended, but there’s one major issue: any sensitive information contained in the data is being sent to OpenAI when the embeddings are being obtained. This is where Private AI’s deidentification service comes in! Let’s update our function to keep the data private and secure.
Let’s add a function to handle the deidentification of and PII in the data.
def deitentify_text(text_list: List[str]):
   pai_client = PAIClient(url=os.environ['PAI_URL'])
   request = rq.process_text_obj(text=text_list, link_batch=True)
   return pai_client.process_text(request)And we’ll update the get_dataframe function to deidentify the data before getting the embeddings from OpenAI.
def get_dataframe(filename: str, delimiter: str, deidentify=False):
   with open(filename, "r") as fin:
      text = fin.read().split(delimiter)
      if deidentify:
         response = deitentify_text(text)
         text = response.processed_text
      embeddings = [get_embeddings(row) for row in text]
      data = {"Text":text, "Embedding": embeddings}
      return pd.DataFrame(data)Before we start asking questions, let’s update the main function to see a comparison of deidentified vs. regular embeddings.
def main():
   filename = "world_cup.txt"
   delimiter = "\n\n"
   deid_df = get_dataframe(filename, delimiter, deidentify=True)
   raw_df = get_dataframe(filename, delimiter, deidentify=False)
   while True:
      message = input("Input question to find relevant text: ")
      if message == 'quit':
         break
      raw_related_text = get_related_text(message, raw_df)
      deid_related_text = get_related_text(message, deid_df)
      print(f"based on the query\n\n{message}\n\n")
      print(f"the most relevant part of the raw file is\n\n{raw_related_text[0]}\n\n with a relatedness of {raw_related_text[1]}\n\n")
      print(f"and the most relevant part of the deidentified file is \n\n{deid_related_text[0]}\n\n with a relatedness of {deid_related_text[1]}")And our script is complete! Here’s the full code:
1import openai
2import os
3import pandas as pd
4from privateai_client import PAIClient, request_objects as rq
5from scipy import spatial
6from typing import List
7
8def get_embeddings(input_text: str):
9    model = "text-embedding-ada-002"
10    return openai.Embedding.create(input=input_text, model=model)["data"][0]["embedding"]
11
12def deitentify_text(text_list: List[str]):
13    pai_client = PAIClient(url=os.environ['PAI_URL'])
14    request = rq.process_text_obj(text=text_list, link_batch=True)
15    return pai_client.process_text(request)
16
17def get_dataframe(filename: str, delimiter: str, deidentify=False):
18    with open(filename, "r") as fin:
19        text = fin.read().split(delimiter)
20        if deidentify:
21            response = deitentify_text(text)
22            text = response.processed_text
23        embeddings = [get_embeddings(row) for row in text]
24        data = {"Text":text, "Embedding": embeddings}
25        if deidentify:
26            entities = []
27            for row in response.entities:
28                entities.append([{"processed_text":entity["processed_text"], "text":entity["text"]} for entity in row])
29            data['Entities'] = entities
30        return pd.DataFrame(data)
31
32def get_relatedness(query_embedding, file_embedding):
33    return 1 - spatial.distance.cosine(query_embedding, file_embedding)
34
35def get_related_text(query: str, dataframe: pd.DataFrame):
36    query_embedding = get_embeddings(query)
37    best_match = ''
38    best_relatedness = 0
39    for i, row in dataframe.iterrows():
40        relatedness = get_relatedness(query_embedding, row['Embedding'])
41        if relatedness > best_relatedness:
42            best_relatedness = relatedness
43            best_match = row['Text']
44    return (best_match, best_relatedness)
45
46def main():
47    openai.api_key = os.environ['OPENAI_API_KEY']
48    filename = "world_cup.txt"
49    delimiter = "\n\n"
50    deid_df = get_dataframe(filename, delimiter, deidentify=True)
51    raw_df = get_dataframe(filename, delimiter, deidentify=False)
52    while True:
53        message = input("Input question to find relevant text: ")
54        if message == 'quit':
55            break
56        raw_related_text = get_related_text(message, raw_df)
57        deid_related_text = get_related_text(message, deid_df)
58        print(f"based on the query\n\n{message}\n\n")
59        print(f"the most relevant part of the raw file is\n\n{raw_related_text[0]}\n\n with a relatedness of {raw_related_text[1]}\n\n")
60        print(f"and the most relevant part of the deidentified file is \n\n{deid_related_text[0]}\n\n with a relatedness of {deid_related_text[1]}")
61
62if __name__ == "__main__":
63    main()If we test out the embedding accuracy with a question:
Which group had the greatest upset in the history of the World Cup?
We can see that the deidentified embeddings are able to capture the correct context!
based on the query
Which group had the greatest upset in the history of the World Cup?
the most relevant part of the raw file is
Main article: 2022 FIFA World Cup Group C
Argentina took an early lead against Saudi Arabia after Lionel Messi scored a penalty kick after ten minutes; however, second-half goals by Saleh Al-Shehri and Salem Al-Dawsari won the match 2–1 for Saudi Arabia, a result described as "the biggest upset in the history of the World Cup." The match between Mexico and Poland ended as a goalless 0–0 draw after Guillermo Ochoa saved Robert Lewandowski's penalty kick attempt. Lewandowski scored his first career World Cup goal in a 2–0 win over Saudi Arabia four days later. Argentina defeated Mexico 2–0, with Messi scoring the opener and later assisting teammate Enzo Fernández who scored his first international goal. Argentina won their last game as they played Poland with goals by Alexis Mac Allister and Julián Álvarez which was enough to win the group; Poland qualified for the knockout stage on goal difference.
 with a relatedness of 0.8722966137987288
and the most relevant part of the deidentified file is
Main article: [EVENT_5]
[ORGANIZATION_8] took an early lead against [LOCATION_COUNTRY_8] after [NAME_23] scored a penalty kick after ten minutes; however, second-half goals by [NAME_24] and [NAME_25] won the match 2–1 for [LOCATION_COUNTRY_8], a result described as "the biggest upset in the history of the [EVENT_2]." The match between [LOCATION_COUNTRY_9] and [LOCATION_COUNTRY_10] ended as a goalless 0–0 draw after [NAME_26] saved [NAME_27]'s penalty kick attempt. [NAME_FAMILY_3] scored his first career [EVENT_2] goal in a 2–0 win over [LOCATION_COUNTRY_8] [DURATION_3] later. [ORGANIZATION_8] defeated [LOCATION_COUNTRY_9] 2–0, with [NAME_FAMILY_4] scoring the opener and later assisting teammate [NAME_28] who scored his first international goal. [LOCATION_COUNTRY_11] won their last game as they played [LOCATION_COUNTRY_10] with goals by [NAME_29] and [NAME_30] which was enough to win the group; [ORGANIZATION_9] qualified for the knockout stage on goal difference.
 with a relatedness of 0.850749854635969




























































































