# joint_align **Repository Path**: greitzmann/joint_align ## Basic Information - **Project Name**: joint_align - **Description**: Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-11-12 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Joint_Align: A Unified Framework for Cross-lingual Alignment and Joint Training ![Model](./illustration.png) This repo contains the source codes for our paper >[Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple Unified Framework](https://arxiv.org/abs/1910.04708) >Zirui Wang*, Jiateng Xie*, Ruochen Xu, Yiming Yang, Graham Neubig, Jaime Carbonell (*: equal contribution) >ICLR 2020 ## Introduction Joint_Align is a unified framework for cross-lingual word embeddings (CLWE). The goal is to use unsupervised joint training as a coarse initialization and then applies alignment methods for refinement. Specifically, it contains three main components: (1) Joint Initialization (2) Vocabulary Reallocation (3) Alignment Refinement. Please see our paper for details. This repo includes two settings where Joint_Align is applied to both non-contextualized and contextualized word embeddings. For non-contextualized embeddings, we show how to obtain one from scratch, and provide scripts to evaluate it on 2 downstream tasks, BLI and cross-lingual NER. For contextualized embeddings, we provide an example on how to apply our framework on [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md), and evaluate it on cross-lingual NER. ## Dependencies * Python 3 * [NumPy](http://www.numpy.org/) * [PyTorch](http://pytorch.org/) * [fastText](https://github.com/facebookresearch/fastText) * [MUSE](https://github.com/facebookresearch/MUSE) * [fast_align](https://github.com/clab/fast_align) * [fastBPE](https://github.com/glample/fastBPE) * [transformers](https://github.com/huggingface/transformers) To get started, run `./get_tools.sh`. ## I. Non-contextualized Word Embeddings ### Train embeddings First, we assume access to monolingual corpus such as Wikipedia for both languages. Use scripts such as [this one](https://github.com/facebookresearch/XLM/blob/master/get-data-wiki.sh) for getting the corpus. The script `train_non_contextualized_embeddings.sh` shows how to use this code to learn cross-lingual non-contextualized word embeddings. This will produce a joint_align embedding at the location `$PWD/word_embeddings/${src_lang}_${tgt_lang}/joint_align_embedding`, which can then be applied to downstream tasks. ### Application: Bilingual Lexicon Induction (BLI) The script `example_BLI.sh` shows how to evaluate the cross-lingual non-textualized word embeddings learned on the BLI task using the MUSE benchmark dataset. Notice that it uses the official evaluation script of MUSE and the results correspond to Table 4 in our paper. To reproduce results in Table 1, please use the following evaluation script (adapted from MUSE) which marks excluded test pairs as incorrect: ``` DICO_EVAL=/path/to/dico/${src_lang}-${tgt_lang}.5000-6500.txt python evaluate_BLI.py --src_emb $SRC_OUTPUT_EMBED --tgt_emb $TGT_OUTPUT_EMBED --dico_path $DICO_EVAL ``` For Russian, please use this [code](https://github.com/facebookresearch/XLM/blob/master/tools/lowercase_and_remove_accent.py) to remove accent from the dictionary. ## II. Contextualized Word Embeddings Joint_Align can be applied to Multilingual BERT by aligning its extracted features before feeding them to downstream models. ### Learn Alignment Matrix First, we apply word alignment tools such as [fast_align](https://github.com/clab/fast_align) on parallel data, and learn alignment matrices using the features corresponding to the aligned words. To do so, simply run `./get_mapping.sh`. ### Application: Cross-lingual NER After we obtain the alignment matrices, we can use them to align extracted features and feed these features for downstream tasks. The steps can be found in `run_feature_ner.sh`.