πReading Note: "Threats to Pre-trained Language Models: Survey and Taxonomy"
Guo, Shangwei, et al. "Threats to pre-trained language models: Survey and taxonomy." arXiv preprint arXiv:2202.06862 (2022).
Last updated
Guo, Shangwei, et al. "Threats to pre-trained language models: Survey and taxonomy." arXiv preprint arXiv:2202.06862 (2022).
Last updated
Pre-trained language models (PTLMs) have achieved great success and remarkable performance, while there are growing concerns regarding their security issues.
Reasons that make PTLMs particularly vulnerable:
Threats can occur at different stages of PTLM pipeline (pre-training, finetuning, inferring) raised by different malicious entities (model publisher, downstream service provider, user);
Two types of model transferability facilitate attacks (landscape & portrait);
Four categories of attacks based on different attack goals (integrity threats: backdoor and evasion attacks & privacy violations: data and model).
Autoencoding Model (AE): pre-trained through corrupting input tokens and attempting to reconstruct the original sentences (e.g., next sentence prediction, masked language model -> BERT: pre-train deep bidirectional representations, some input tokens are replaced by [MASK]).
Autoregressive Model (AR): trained to encode unidirectional context and predict the token of current time-step according to the tokens read before (e.g., text generation -> GPT).
Pre-training: Model Publisher (MP) trains a foundation PTLM from enormous unsupervised corpus.
Fine-tuning: Downstream Service Provider (DSP) obtains PTLM from MP, and transfers it to a specific downstream model (usually append an auxiliary structure such as a linear classifier to PTLM, and fine-tune with downstream corpus in a supervised manner).
Inferring: DSP deploys the fine-tuned model as a NLP service, and provides APIs for users. When receiving text queries, the inference system conducts forward propagation to obtain the output.
Integrity Attacks: to compromise the integrity of model parameters or predictions.
Backdoor: by malicious MP, embed backdoors into PTLM, which can be activated by malicious input (containing specific triggers) of the downstream model.
Evasion attack: by malicious user at inferring time, craft adversarial examples to mislead the downstream model to produce wrong results.
Privacy Attacks: to steal sensitive information from pre-trained or downstream models.
For data: by DSP or user, recover attributes, keywords, or entire sentence of training corpus.
For model: by user, extract the proprietary pre-trained model.
Landscape Transferability: downstream models from the same PTLM share similar language representation features, transfer attack between them.
Portrait Transferability: inject backdoors into PTLM, inherited by downstream models.
To inject the backdoor to the victim model by poisoning the training samples or directly manipulating the parameters. The infected model still behaves normally for clean samples, but outputs wrong predictions for input containing attacker-specified triggers.
Two categories based on adversaryβs knowledge:
Task-specific attacks: adversarial MP has knowledge of downstream tasks (e.g., fine-tuning methods, partial or full finetuning corpus), and builds backdoored PTLMs specifically for those tasks (e.g., RIPPLe [1], context-aware generative model-based [2]). Not realistic in most cases.
Task-agnostic attacks: enable the embedded backdoor to transfer to arbitrary downstream models (e.g., BadPre [3], NeuBA [4], POR-based [5], layer weight poisoning training [6]).
A malicious user designs adversarial text inputs, which are semantically indistinguishable from normal ones, to mislead the target downstream models in the inferring phase.
2.2.1 White-box Attack
To compute the malicious input based on the model parameters (e.g., measure the gradient distance between normal and adversarial words [7]).
2.2.2 Black-box Aattack
A possible strategy is to construct a shadow model from which the adversarial examples are generated (has high chance when shadow and victim models are transferred from the same PTLM).
Two categories based on the granularity of adversarial perturbations:
1) Word-level attacks
Heuristic generation: design perturbation through pre-defined rules (e.g., TextFooler based on word importance [8], swarm optimization-based method [9], Adv-OLM to select words for replacement [10], population-based optimization [11], syntactically incorrect word generation [12], transformer-based extension of TextFooler for high transferability [13], population-based genetic algorithm for high transferability [14]).
Automatic generation: leverages an additional model to automatically generate substitution words to achieve better semantic indistinguishability (e.g., BERT-Attack to find important words by the [MASK] [15], BAE to utilize contextual perturbations from a BERT masked language model [16], CLARE through a mask-then-infill procedure [17], modification with shared words [18], MORPHEUS to perturb the inflectional morphology of words [19]).
2) Sentence-level attacks
Craft adversarial sentences with exploitation of sentence structures and contexts, instead of replacing certain words (e.g., irrelevant sentences for machine reading comprehension [20], T3 with tree-based autoencoder embedding discrete text into a continuous representation space [21], paraphrase datasets [22, 23]).
ML models can memorize data [24], which allows malicious DSP or users to steal key information of training or inference samples from embedding codes and PTLMs.
Three categories according to the type of extracted information:
Embedding inversion attacks: DSP can invert the original sentence of an inference input based on the corresponding embedding code [25].
Attribute inference attacks: MIA [25] (also see another reading note here) & keyword inference attacks (whether certain keywords exist in an unknown inference sentence) [25, 26].
Corpus inference attacks: extract the training corpus from PTLMs [27] or downstream models [28].
A malicious user could perform model extraction attacks (MEAs) to reconstruct the proprietary model by querying the system in the inferring phase.
Two categories according to the extraction goals:
Accuracy extraction attacks: extract a model with similar accuracy on the text data as the victim PTLM (e.g., task-specific query generator [29] and algebraic extraction attack [30], both against BERT-based models).
Fidelity extraction attacks: to steal a PTLM with similar behaviors as the victim one (e.g., imitation attack [31] and querying gibberish data for monolingual models [32]).
Robustness enhancement: an arms race - between designing more sophisticated attacks against backdoor detection or removal methods (e.g., [33, 34]) and combining the characteristics of PTLM systems with conventional robustness solutions (e.g., adversarial training [35]).
Trade-off between utility and security: obfuscating model parameters or inference behaviors is common for preventing information leakage (e.g., added Gaussian noise to defeat MIA [26]), but can affect the utility of PTLMs.
Transferability improvement: to reduce the attack transferability of PTLMs while maintaining its generalization.
(For backdoor threats, it is valuable to devise a finetuning method that only transfers the knowledge of PTLM for normal data while forgetting the knowledge of malicious data with triggers.)
[1] Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pretrained models. In ACL, 2020.
[2] Xinyang Zhang, Zheng Zhang, Shouling Ji, and Ting Wang. Trojaning language models for fun and profit. In S&P, 2021.
[3] Kangjie Chen, Yuxian Meng, Xiaofei Sun, Shangwei Guo, Tianwei Zhang, Jiwei Li, and Chun Fan. Badpre: Task-agnostic backdoor attacks to pretrained NLP foundation models. arXiv preprint, 2021.
[4] Zhengyan Zhang, Guangxuan Xiao, Yongwei Li, Tian Lv, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Xin Jiang, and Maosong Sun. Red alarm for pretrained models: Universal vulnerability to neuron-level backdoor attacks. In ICML, 2021.
[5] Lujia Shen, Shouling Ji, Xuhong Zhang, Jinfeng Li, Jing Chen, Jie Shi, Chengfang Fang, Jianwei Yin, and Ting Wang. Backdoor pre-trained models can transfer to all. In CCS, 2021.
[6] Linyang Li, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, and Xipeng Qiu. Backdoor attacks on pre-trained models by layerwise weight poisoning. In EMNLP, 2021.
[7] Yong Cheng, Lu Jiang, and Wolfgang Macherey. Robust neural machine translation with doubly adversarial inputs. In ACL, 2019.
[8] Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT really robust? a strong baseline for natural language attack on text classification and entailment. In AAAI, 2020.
[9] Yuan Zang, Fanchao Qi, Chenghao Yang, Zhiyuan Liu, Meng Zhang, Qun Liu, and Maosong Sun. Word-level textual adversarial attacking as combinatorial optimization. In ACL, 2020.
[10] Vijit Malik, Ashwani Bhat, and Ashutosh Modi. Adv-OLM: Generating textual adversaries via OLM. arXiv preprint, 2021.
[11] Rishabh Maheshwary, Saket Maheshwary, and Vikram Pudi. Generating natural language attacks in a hard label black box setting. In AAAI, 2021.
[12] Fan Yin, Quanyu Long, Tao Meng, and Kai-Wei Chang. On the robustness of language encoders against grammatical errors. In ACL, 2020.
[13] Chris Emmery, Μ Akos K Μ ad Μ ar, and Grzegorz ChrupaΕa. Adversarial stylometry in the wild: Transferable lexical substitution attacks on author profiling. arXiv preprint, 2021.
[14] Liping Yuan, Xiaoqing Zheng, Yi Zhou, Cho-Jui Hsieh, and Kai-Wei Chang. On the transferability of adversarial attacks against neural text classifier. In EMNLP, 2021.
[15] Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: Adversarial attack against BERT using BERT. arXiv preprint, 2020.
[16] Siddhant Garg and Goutham Ramakrishnan. BAE: BERT-based adversarial examples for text classification. arXiv preprint, 2020.
[17] Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Brockett, Ming-Ting Sun, and Bill Dolan. Contextualized perturbation for textual adversarial attack. arXiv preprint, 2020.
[18] Zhouxing Shi and Minlie Huang. Robustness to modification with shared words in paraphrase identification. arXiv preprint, 2019.
[19] Samson Tan, Shafiq Joty, Min-Yen Kan, and Richard Socher. Itβs MorphinβTime! combating linguistic discrimination with inflectional perturbations. arXiv preprint, 2020.
[20] Jieyu Lin, Jiajie Zou, and Nai Ding. Using adversarial attacks to reveal the statistical bias in machine reading comprehension models. arXiv preprint, 2021.
[21] Boxin Wang, Hengzhi Pei, Boyuan Pan, Qian Chen, Shuohang Wang, and Bo Li. T3: Treeautoencoder constrained adversarial text generation for targeted attack. In EMNLP, 2020.
[22] Wee Chung Gan and Hwee Tou Ng. Improving the robustness of question answering systems to question paraphrasing. In ACL, 2019.
[23] Yuan Zhang, Jason Baldridge, and Luheng He. Paws: Paraphrase adversaries from word scrambling. In NAACL-HLT, 2019.
[24] Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. NeurIPS, 2020
[25] Congzheng Song and Ananth Raghunathan. Information leakage in embedding models. In CCS, 2020.
[26] Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. Privacy risks of general-purpose language models. In S&P, 2020.
[27] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In USENIX Security, 2021.
[28] Santiago Zanella-Beguelin, Lukas Wutschitz, Shruti Tople, Victor R Μ uhle, Andrew Paverd, Olga Ohrimenko, Boris Kopf, and Marc Brockschmidt. Analyzing information leakage of updates to natural language models. In CCS, 2020.
[29] Xuanli He, Lingjuan Lyu, Qiongkai Xu, and Lichao Sun. Model extraction and adversarial transferability, your BERT is vulnerable! arXiv preprint, 2021.
[30] Santiago Zanella-Beguelin, Shruti Tople, Andrew Paverd, and Boris Kopf. Grey-box extraction of natural language models. In ICML. PMLR, 2021.
[31] Qiongkai Xu, Xuanli He, Lingjuan Lyu, Lizhen Qu, and Gholamreza Haffari. Beyond model extraction: Imitation attack for black-box NLP APIs. arXiv preprint, 2021.
[32] Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. The thieves on sesame street are polyglots-extracting multilingual models from monolingual APIs. In EMNLP, 2020.
[33] Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. ONION: A simple and effective defense against textual backdoor attacks. In EMNLP, 2021.
[34] Chun Fan, Xiaoya Li, Yuxian Meng, Xiaofei Sun, Xiang Ao, Fei Wu, Jiwei Li, and Tianwei Zhang. Defending against backdoor attacks in natural language generation. arXiv preprint, 2021.
[35] Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.