Comparing Privacy and Safety Concerns Around Llama 2, GPT4, and Gemini 

Share This Post

There are lots of risks involved with implementing AI systems into an organization’s environment, some of which are not even known or knowable at this time. Privacy is one of the ones we have a better shot at getting a handle on right away, though, but we first need to understand the issues associated with the popular AI models out there. We have previously written on the privacy and security measures that are taken by OpenAI and Azure as well as the remaining concerns. This article takes one step back and looks not at the AI system but the underlying large language models (LLMs), namely Llama 2, GPT4, and Gemini, considering their development backgrounds and training processes, specifically the use of Personally Identifiable Information (PII) in their training data. We also look at whether they are open sourced or not and what challenges that may pose.

We begin with Llama 2 because it scored highest on Stanford’s Foundation Model Transparency Index, in particular with regard to the information available on the models’ safety. (This index was published pre-Gemini but as we will see in the following, comparatively little is available in terms of the model’s training data.) This high score achieved by Llama 2 is still only a 54/100. This observation alone, further underscored by the gaps in Meta’s disclosure regarding the Llama models detailed below, demonstrates the dire need for upcoming AI regulations to cover privacy, particularly regarding foundation model training data. Leaving it to the tech companies to make these decisions for themselves (and the rest of the world) demonstrably leads to very little disclosure on what these models were trained on and how much PII is stored inside them.

A terribly good example of how wrong it can go is this South Korean love chatbot that started to disclose PII it had been trained on to other users in production. What makes it even worse is that the training data had been obtained from intimate conversations between partners who had no idea their messages were going to be used for this purpose.

Llama 2

Llama 2 is a collection of pre-trained and fine-tuned LLMs developed by Meta that include an updated version of Llama 1 and Llama2-Chat, optimized for dialogue use cases. Both models are released in three different variants with parameters ranging from 7 to 70 billion. Llama 2 is capable of generating text and code in response to prompts. Llama 2 is Meta’s response to OpenAI’s GPT models and Google’s AI models like PaLM 2, but with the key difference that it is open source, licensed for commercial and research use. We’ll discuss a little later what that means and its implications in general and for privacy in particular.

In contrast to when it published LLaMA, Meta has not disclosed the dataset that it used to train the model, allegedly for competitive reasons, but presumably (also) to ensure it does not expose itself to lawsuits from copyright owners and individuals whose personal information was included. But we do know that the pre-training corpus is 40 percent larger than for Llama 1 and that it is a “new mix of publicly available online data.” As a possible approximation of what may be included in Llama 2’s training set, here is what was included in Llama 1’s:

The research paper released alongside the open-sourced models, available on Hugging Face for free, advises further that the training data does not include data from Meta’s products or services or Meta user data and that an effort was made to remove data from public websites containing large amounts of personal information about private citizens. No information is available on how this removal was undertaken or how successful it was.

Language Diversity

The table below shows the language diversity in the training set used for the Llama 2 models. 

Meta, in its research paper, makes the following disclaimer: “A training corpus with a majority in English means that the model may not be suitable for use in other languages.” We also learn that testing is only conducted in English, making it difficult to gauge the model’s abilities and usability in other languages.

Toxicity

Meta observed increased toxicity in pretrained 13B and 70B Llama 2 models, possibly due to larger pretraining data, but a decrease for the small model, all as compared to Llama 1. Llama 2 also does not outperform other models in regard to toxicity, perhaps because no extensive filtering on pretraining data has been performed. 

As we can see from the following table and its subtext, this was done intentionally because of potentially useful downstream tasks, such as using the model as a hate speech classifier, as well as to avoid accidentally filtering out content disproportionally representing certain demographic groups. The research paper emphasises that this requires more safety measures being put into place by end users before deploying Llama 2. The fine-tuned Llama 2 Chat, on the other hand, scores very well in truthfulness, toxicity and bias, regardless of model size.

Safety Testing

It is clear from the research paper we are drawing from here that Meta is aware of (some of) the safety risks associated with its models. Meta conducted extensive adversarial testing, called red teaming, involving 350 people, among them experts in cybersecurity, election fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, ML, responsible AI, and creative writing before releasing Llama 2. This group also included people with a variety of socioeconomic, gender, ethnicity, and racial backgrounds. The red teaming was conducted in areas such as planning of criminal activities, sexually explicit content, unqualified advice, and privacy violations. While only English model outputs were targeted, the prompts used to attack the model included non-English inputs. Insights gained include that couching unsafe requests in innocuous language such as creative writing requests or positive, progressive, and empowering context could get the model to produce unsafe content. The identified shortcomings were mitigated, the research paper assures us. Meta advises that later models had an average rejection rate of red teaming prompts of 90 percent. More detailed numbers are only provided for the small 7B parameter model.

Llama 2 Chat has been tested a little bit for its ability to use tools such as search and a calculator via API calls. One example provided in the research paper suggests that the model can efficiently and accurately use tools without ever having been trained on doing so. The authors remark that this may come with safety concerns and that more research and red teaming is encouraged. The reason why this is important is because it can give us a small glimpse of the ability of the models to perform tasks it has not been trained on, which developers and users might thus not be aware of.

Despite the fact that privacy experts took part in the red teaming exercise, we find no information in the research paper as to the model’s likelihood to disclose personal information it may have been trained on. This particular risk is simply not mentioned further.

GPT-4

OpenAI’s GPT-4 research paper is in scope and content comparable to Meta’s, but OpenAI’s reluctance to be more forthcoming about details such as its training data has made a larger splash in the media given its founding ethos, reflected in the company’s name.

Speaking to The Verge in an interview, Ilya Sutskever, OpenAI’s chief scientist and co-founder responded to the question why no further details were provided as follows: 

“On the safety side, I would say that the safety side is not yet as salient a reason as the competitive side. But it’s going to change, and it’s basically as follows. These models are very potent and they’re becoming more and more potent. At some point it will be quite easy, if one wanted, to cause a great deal of harm with those models. And as the capabilities get higher it makes sense that you don’t want to disclose them.”

Sutskever justified OpenAI’s approach to sharing its research extensively in the past: 

“We were wrong. Flat out, we were wrong. If you believe, as we do, that at some point, AI — AGI — is going to be extremely, unbelievably potent, then it just does not make sense to open-source. It is a bad idea… I fully expect that in a few years it’s going to be completely obvious to everyone that open-sourcing AI is just not wise.”

This foresight is wise and commendable, but it does not seem to explain why lack of transparency regarding the training data set used and the safety measures taken is justified. It would appear that knowing the content of the training data will not enable the replication of the model but it would empower individuals to know what has or has not been done with their personal information. It would further provide downstream users the required background to make an informed decision as to whether the use of the model exposes them to privacy rights claims or other consequences of violations of privacy laws and regulations, such as copyright lawsuits such as the one brought by the New York Times against OpenAI

In the absence of information of what personal information may be included in the training data, one relevant snippet of information can, however, be found in OpenAI’s research paper, on page 43. We learn that OpenAI has consulted 50 experts from different areas of interest who have helped identify deployment risk. “Through this analysis, we find that GPT-4 has the potential to be used to attempt to identify private individuals when augmented with outside data.” This statement does not concede that personal data have been included in the training set, but that the model has capabilities that can be used to facilitate the identification of individuals, which does not come as a surprise. 

Gemini

Google published by far the thinnest research paper alongside the release of its model, Gemini, compared to OpenAI and Meta. The paper is, for the most part, only focused on setting out how Gemini compares on various performance benchmarks.  Granted, Google advises more information regarding toxicity metrics and other content safety issues is forthcoming, so we will have to wait for that for a more thorough analysis. At this time, in any case, we know only that Gemini is a “natively” multi-modal model that is trained on audio, video, code, and text, and in a novel way, as Google explains:

“Until now, the standard approach to creating multimodal models involved training separate components for different modalities and then stitching them together to roughly mimic some of this functionality. These models can sometimes be good at performing certain tasks, like describing images, but struggle with more conceptual and complex reasoning.

We designed Gemini to be natively multimodal, pre-trained from the start on different modalities. Then we fine-tuned it with additional multimodal data to further refine its effectiveness. This helps Gemini seamlessly understand and reason about all kinds of inputs from the ground up…”

Another notable difference between Gemini and the other LLMs considered here is that Google developed a version of Gemini, Gemini Nano, that can run on mobile devices and perform tasks such as summarization, reading comprehension, and text completion. On-device deployment of LLMs versus running an app that integrates, say, an LLM-powered chatbot means that data is processed locally rather than transmitting it over the network. However, it also means that the device itself becomes a critical point for data security. If the device on which the model is deployed is compromised, the data is at risk. Furthermore, with significant expertise it might be possible for users with access to the model on their device to reverse engineer the entire model, given that its architecture and weights will be present on the device. This might allow them to strip away the security infrastructure and other security guardrails that prevent the model from performing dangerous tasks. A close eye should be kept on Gemini Nano as the first of its kind of on-device deployable models. 

Conclusion

The comparative study of Llama 2, GPT-4, and Gemini highlights that very little has been disclosed about the training data of these LLMs, and therefore we have no idea how much PII is stored in the models. Also, whilst all the papers mention privacy experts tested them there’s no hard results or numbers, or anything for downstream users to rely on. Again, the risk of AI systems spewing out in production what they have learned during training is not a risk to which organizations should expose themselves. Open sourcing of models and on-device deployment present new challenges and opportunities for data privacy. Organizations must navigate this complex landscape with informed strategies, prioritizing transparency, and robust risk mitigation to harness AI’s power responsibly. As AI continues to evolve, so too must our approaches to ensuring its safe and ethical use.

One approach to rendering LLMs safer from a privacy perspective would be to first have greater transparency regarding the personal information contained in data sets and to then ensure that this is reduced to the absolute minimum, perhaps by replacing personal information with synthetic data. Private AI can help with that. Its ability to detect and redact or replace over 50 entities of personal information in 50 different languages and in different formats including text files, audio, and video with unparalleled accuracy constitutes an effective way to address these issues. Try it on your own data here or request an API key



Subscribe To Our Newsletter

Sign up for Private AI’s mailing list to stay up to date with more fresh content, upcoming events, company news, and more! 

More To Explore

Download the Free Report

Request an API Key

Fill out the form below and we’ll send you a free API key for 500 calls (approx. 50k words). No commitment, no credit card required!

Language Packs

Expand the categories below to see which languages are included within each language pack.
Note: English capabilities are automatically included within the Enterprise pricing tier. 

French
Spanish
Portuguese

Arabic
Hebrew
Persian (Farsi)
Swahili

French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Belarusian
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
Greek
Hungarian
Icelandic
Latvian
Lithuanian
Luxembourgish
Polish
Romanian
Slovak
Slovenian
Swedish
Turkish

Hindi
Korean
Tagalog
Bengali
Burmese
Indonesian
Khmer
Japanese
Malay
Moldovan
Norwegian (Bokmål)
Punjabi
Tamil
Thai
Vietnamese
Mandarin (simplified)

Arabic
Belarusian
Bengali
Bulgarian
Burmese
Catalan
Croatian
Czech
Danish
Dutch
Estonian
Finnish
French
German
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Italian
Japanese
Khmer
Korean
Latvian
Lithuanian
Luxembourgish
Malay
Mandarin (simplified)
Moldovan
Norwegian (Bokmål)
Persian (Farsi)
Polish
Portuguese
Punjabi
Romanian
Russian
Slovak
Slovenian
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian
Vietnamese

Rappel

Testé sur un ensemble de données composé de données conversationnelles désordonnées contenant des informations de santé sensibles. Téléchargez notre livre blanc pour plus de détails, ainsi que nos performances en termes d’exactitude et de score F1, ou contactez-nous pour obtenir une copie du code d’évaluation.

99.5%+ Accuracy

Number quoted is the number of PII words missed as a fraction of total number of words. Computed on a 268 thousand word internal test dataset, comprising data from over 50 different sources, including web scrapes, emails and ASR transcripts.

Please contact us for a copy of the code used to compute these metrics, try it yourself here, or download our whitepaper.