Differential privacy is a hot topic given the many conflicting opinions on its effectiveness. For some background, we previously wrote a comprehensive post on the Basics of Differential Privacy where we discussed the risks and how it can also enhance natural language understanding (NLU) models.
The differential privacy papers in this post are just a few of the many different kinds of papers that exist on the subject. The ideas, problems, and potential solutions of different topics in differential privacy are essential to machine learning engineers, researchers, developers, and data scientists.
If you’re dealing with the utility trade-off and privacy of language models and would like to learn from the state-of-the-arts, here are the summaries of the top 7 differential privacy papers for language modeling that you should read.
1. Differentially Private Language Models Benefit from Public Pre-Training
Author by: Gavin Kerrigan, Dylan Slack, and Jens Tuyls
Kerrigan et al, 2020 outlines one of the early attempts at differential privacy. They train a non-private language model on a large, public dataset (the Brown dataset), then fine-tune the model on a private dataset (Reddit) through differentially private stochastic gradient descent (DP-SGD). They used feedforward networks as language models in their experiments, and demonstrated that the pre-training step is crucial before private fine tuning.
2. Learning and Evaluating a Differentially Private Pre-trained Language Model
Authored by: Shlomo Hoory, Amir Feder, Avichai Tendler, Alon Cohen, Sofia Erell, Itay Laish, Hootan Nakhost, Uri Stemmer, Ayelet Benjamini, Avinatan Hassidim and Yossi Matias
Hoory et al., 2021 presents a differentially private word-piece algorithm, which allows training a tailored domain-specific vocabulary while maintaining privacy. Then, they propose vastly increasing batch sizes to 128k to achieve reasonable performance under a high privacy guarantee 𝜖=1.1.
The figure below shows that with smaller variance 𝜎, training more epochs and large batch sizes leads to higher privacy guarantee 𝜖 and a higher F1 score. Finally, they validated that their training language model doesn’t memorize private information by developing a secret sharing evaluation test.
Figure 1: Top to bottom – DP-SGD privacy parameter 𝜖
(red) and test F1 score on the i2b2-2010 EE task (blue), as a function of noise multiplier 𝜎, number of pre-training epochs, and pre-training batch size.
3. Large-Scale Differentially Private BERT
Authored by: Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, and Pasin Manurangsi
Anil et al, 2021 studies the large-scale pre-training of BERT-Large using DP-SGD with weight decay. They achieve high fine-tuning accuracy by scaling up the batch size to millions, increasing batch size over time during training, and using a large weight decay parameter.
They define gradient signal to noise ratio (SNR) as the ratio between the norms of the aggregated gradient of the network and the noise vector. As training progresses, the gradient norm decreases over time, and the noise starts dominating. To tackle this, they try using larger batch sizes and increasing batch size after training a fixed number of epochs slows gradient-SNR from decaying. The figures below show that these two approaches result in higher accuracy.
Gradient SNR and MLM accuracy increase as batch size increases.
Figure 2: Orange curve shows that increasing the batch size from 262k to 1 million gradually improves MLM accuracy. The red curve represents batch size 262k and blue curve represents batch size 1 million.
In this paper they also discuss how using large weight decay leads to better results. It is also argued that the layer normalization makes a layer’s output independent of the weights, but differential privacy adds Gaussian noise to the gradient, so the norms of the weights are increased, which in turn shrinks the actual gradient and makes training ineffective. A large weight decay helps alleviate this issue.
Additionally, they leverage recent JAX primitives with XLA compiler and TPU training to improve the efficiency of DP-SGD.
4. Benchmarking Differential Privacy and Federated Learning for BERT Models
Authored by: Priyam Basu, Tiasa Singha Roy, Rakshit Naidu, Zumrut Muftuoglu, Sahib Singh, and Fatemehsadat Mireshghallah
Basu et al., 2021 compares the utility of central and federated training of BERT-based models for different levels of privacy guarantee 𝜖, using depression and sexual harassment-related Tweets. We only attached the results of centralized differential privacy here, which show that although all models degrade as privacy requirements increase, smaller networks (ALBERT and DistillBERT) degrade much more gracefully than larger models (BERT and RoBERTa).
Figure 3: Accuracy on Depression Tweets
5. Large Language Models Can Be Strong Differentially Private Learners
Authored by: Xuechen Li, Florian Tramèr, Percy Liang, and Tatsunori Hashimoto
Li et al., 2021 shows that the performance drops in applying DP-SGD can be mitigated in the following ways:
– Using large pre-trained models.
– Optimized hyperparameters: They showed empirically that large batch size, large learning rate and smaller clipping norm C leads to better accuracy.
– Fine-tuning objective aligned with pre-training: For sentiment classification, instead of classifying the encoding of the [CLS] token, they ask the model to predict the [MASK] token in the sentence “<INPUT>. It is [MASK].”, and compare the probability of words “awesome” and “terrible”.
– They propose “ghost clipping” to reduce the complexity of computing the per-example gradients norm, allowing private training to cost almost the same amount of memory as non-private training of transformers.
Figure 4: Accuracies on sentence classification (left) and language generation (right) increase as model size gets larger.
Figure 5: Large batch sizes, large learning rates (left) and small clipping norms (right) lead to the best performance.
Figure 6: Large batch sizes (q in the figure) have higher gradient signal to noise ratio, which log-linearly correlates with model performance.
Figure 7: Ghost clipping is almost as memory efficient as non-private training and has higher throughput than other methods.
6. Differentially Private Fine-tuning of Language Models
Authored by: Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang
Yu et al., 2021 explores parameter-efficient ways to privately fine-tune language models, and achieve state-of-the-art accuracy. Instead of updating all parameters during fine-tuning, the authors explore fine-tuning through low rank adaptation, adaptors, and compactors.
Low-Rank Adaptation (LoRA): Adding a low-rank matrix 𝐿𝑖𝑅𝑖 to each pre-trained weight matrix. During fine tuning, pre-trained weights are fixed and only the low-rank matrices are updated.
Adaptor: Transforming the weights x by applying down-projection matrix D(), activation function tau and up-projection matrix U(). Only U and D are updated during fine tuning.
Compactor: Another form of transformation, by multiplying weights with tensor products of smaller matrices:
Figure 8: An illustration of non-private pre training + private finetune framework.
Figure 9: Accuracy of fine tuning for downstream tasks.
Figure 10: Accuracy fine-tuning RoBERTa-Large. Low-rank adaptation performs the best among all private fine-tuning methods.
7. Natural Language Understanding with Privacy-Preserving BERT
Authored by: Chen Qu, Weize Kong, Liu Yang, Mingyang Zhang, Michael Bendersky, and Marc Najork
While the papers summarized above use DP-SGD, Qu el al., 2021 explores privatization of text, text embedding, and sentence embedding. They apply dx-privacy, a variant of local DP to BERT fine-tuning, through text-to-text and token representation privatizations. Additionally, they propose privacy-adaptive LM pre-training.
In this paper the authors apply dx-privacy to text privatization in the following ways:
Token representation privatization: Adding noise vector N to the token embedding x, where N is sampled from a distribution with density p(N) exp(-||N||). The resulting embedding M(x)=x+N is used as model input.
Token-to-token privatization: First, obtain M(x) in the same way as above, then perform nearest neighbor search, get the token whose embedding is closest to M(x) and replace the original token.
Sentence representation privatization: Adding noise to the output of the encoder, then using it for prediction.
To improve the model performance on private fine-tuning, they proposed the vanilla Masked Language Model (MLM), the prob MLM and the denoising MLM for privatized model pre-training, and showed that the denoising MLM improves the fine-tuning accuracy the most. In the denoising MLM, the model tries to predict the original masked tokens based on the privatized context.
Figure 11: Users privatize text or text representations locally, send it to the server, then the server uses it to perform privacy-preserving pre-training and fine-tuning.