# adversarial-polyglots

**Repository Path**: mirrors_salesforce/adversarial-polyglots

## Basic Information

- **Project Name**: adversarial-polyglots
- **Description**: Code for the paper "Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots" (NAACL-HLT 2021)
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-05-26
- **Last Updated**: 2026-01-03

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Adversarial Polyglots
This repository contains code for the paper "[Code-Mixing on Sesame Street: Dawn of the Adversarial Polyglots](https://arxiv.org/abs/2103.09593)" (NAACL-HLT 2021).

Authors: [Samson Tan](https://samsontmr.github.io) and [Shafiq Joty](https://raihanjoty.github.io)


# Usage

## Adversarial Polyglots
Scripts for running `PolyGloss` and `Bumblebee` on NLI and QA datasets (MNLI/SQuAD formats) are in the `attacks` folder. Preprocessing scripts can be found in `scripts`. `PolyGloss` and `Bumblebee` return the _*adversarial*_ examples with the highest and lowest losses. The one that induced the lower loss _(minimally perturbed)_ is usually less perturbed, but the one that induced a higher loss _(maximally perturbed)_ should transfer more successfully to other models.

`PolyGloss` requires a dictionary constructed from the bilingual [MUSE dictionaries](https://github.com/facebookresearch/MUSE#ground-truth-bilingual-dictionaries). After downloading the dictionaries into a folder under `scripts` labeled `dictionaries`, run `scripts/create_polygloss_dictionaries_from_muse.py`.

`Bumblebee` requires a dictionary/JSON consisting of sentence-translations pairs. The `extract-xnli-sentences-to-dict.py` and `extract-xquad-questions-to-dict.py` scripts in `scripts` can be used to create these dictionaries for files in the MNLI and SQuAD formats (e.g., XNLI and XQuAD). JSONs for the XNLI test set can be found [here](https://github.com/salesforce/adversarial-polyglots-data).

## Code-mixed Adversarial Training
Code for generating code-mixed adversarial training (CAT) examples are in `adversarial-training`. Since the alignment step is the most time-consuming, we decouple it from the example perturbation step. Users can generate only the alignments by using the `--extract_phrases` option or load precomputed alignments via the `phrase_alignments` option.

Similar to `Bumblebee`, `Code-Mixer` requires a dictionary/JSON consisting of sentence-translations pairs. The `extract-xnli-sentences-to-dict.py` and `extract-xquad-questions-to-dict.py` scripts in `scripts` can be used to create these dictionaries for files in the MNLI and SQuAD formats (e.g., XNLI and XQuAD).

# Translated XNLI Data
We translated the [XNLI data](https://cims.nyu.edu/~sbowman/xnli) to 18 other languages using machine-translation systems (see paper for details). Translation script is in `scripts`. Translated data can be found [here](https://github.com/salesforce/adversarial-polyglots-data).

# Citation
Please cite the following if you use the code/data in this repository:
```
@inproceedings{tan-joty-2021-code-mixing,
    title = "Code-Mixing on Sesame Street: {D}awn of the Adversarial Polyglots",
    author = "Tan, Samson and Joty, Shafiq",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.282",
    pages = "3596--3616",
}

```