# diamat **Repository Path**: mirrors_alvations/diamat ## Basic Information - **Project Name**: diamat - **Description**: Machine Translation Diagnostics Tool - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-24 - **Last Updated**: 2026-03-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # DiaMaT An MT diagnostics tool. DiaMaT learns from corpora containing millions of translations but offers explanations on sentence level. ![screenshot](https://github.com/DFKI-NLP/diamat/blob/master/resources/screenshot.png) Code accompanying the paper: ``` @inproceedings{Schwarzenberg_tse_2019, title = {Train, Sort, Explain: Learning to Diagnose Translation Models}, booktitle = {Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT): Demonstrations.}, author = {Schwarzenberg, Robert and Harbecke, David and Macketanz, Vivien and Avramidis, Eleftherios and M\"oller, Sebastian}, location = {Minneapolis, Minnesota, USA}, year = {2019} } ``` ## Info A DiaMaT demo is hosted [here](http://diamat.dfki.de). Optimized for the Firefox browser w/ a resolution of 1920x1080. DiaMaT deploys the [iNNvestigate](https://github.com/albermax/innvestigate) toolbox. To facilitate the replication of experiments, if this repo is cloned, 500 MB of data will be directly downloaded from the GitHub LFS server. ## Installation Download embeddings: ``` wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.de.vec wget https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec ``` Link wiki.de.vec and wiki.en.vec in ./data/input/word_vectors (see ./data/input/config.INI) Use Python 3.6, e.g.: ``` conda create --name diamat python=3.6 ``` Activate environment: ``` source activate diamat ``` Install requirements (if CUDA GPU is available install tensorflow-gpu==1.7.0): ``` pip install -r requirements.txt ``` Install SpaCy language models ```bash python -m spacy download de python -m spacy download en ``` ## Run / Replicate Experiments Please, first validate data/input/config.INI. Unzip ./data/output.zip. Run train.py and then explain.py, afterwards link data/output/explain.jsonl in server/static/input, e.g.: ``` CUDA_VISIBLE_DEVICES=0 python3 train.py 2>&1 | tee -a ./data/output/fit_gpu_deeplee_quadro.log && CUDA_VISIBLE_DEVICES=0 python3 explain.py 2>&1 | tee -a ./data/output/fit_gpu_deeplee_quadro.log && ln -s ../../../data/output/explain.jsonl ./server/static/input/explain.jsonl 2>&1 | tee -a ./data/output/fit_local.log ``` Then start the server: ``` cd server && sh run_server.sh ``` Visit ``` firefox localhost:5000 ``` ## Data All preprocessed data needed to replicate the experiments is contained in ./data/output - ./data/output/train.jsonl > 1M JSON lines to train the text classifier - ./data/output/val.jsonl > 100k JSON lines to validate the classifier during training - ./data/output/xtrain.jsonl > 100k JSON lines to train the explainability method (if needed) - ./data/output/explain.jsonl > ca 20k JSON lines to test DiaMaT on (drawn from the official WMT test sets / excluding WMT13) ./data/output/explain.jsonl contains contributions (from the experiments) which are overwritten by explain.py ### How to use your own data: - Prepare three parallel text files: source.txt, machine.txt and human.txt - Then run: ``` python preprocess.py ``` Afterwards ./data/output/bundle.jsonl contains the bundled texts, tokens, indices etc. Rename bundle.jsonl, replace source.txt, machine.txt, human.txt and repeat the step if you would like to preprocess more data. Update ./data/input/config.INI and run train.py and explain.py again. Update server/static/input/explain.jsonl w/ the new explanations and run the server again.