Updated on 04/07/2021.
The baseline code for the shared task Triangular MT: Using English to improve Russian-to-Chinese machine translation
.
All scripts should run from root folder:
bash **.sh
bash scripts/**.sh
python scripts/**.py
A linux machine GPU and installed CUDA >= 10.0
miniconda
on your machine.setup_env.sh
with interactive mode:
bash -i setup_env.sh
environment.yml
line 3-4
and setup_env.sh
line 9
for a better speed.To participate please register to the shared task on Codalab .
Link to Codalab website
.
We will use the toolkit tensor2tensor
to train a Transformer based NMT system.
config/run.ru_zh.big.single_gpu.json
lists all the configurations.
{
"version": "ru_zh.big.single_gpu",
"processed": "ru_zh",
"hparams": "transformer_big_single_gpu",
"model": "transformer",
"problem": "machine_translation",
"n_gpu": 1,
"eval_early_stopping_steps": 14500,
"eval_steps": 10000,
"local_eval_frequency": 1000,
"keep_checkpoint_max": 10,
"beam_size": [
4
],
"alpha": [
1.0
]
}
The hyperparameter set is transformer_big_single_gpu
.
We will use only 1
GPU.
The model will evaluate the dev loss and save the checkpoint every 1000
steps.
If the dev loss doesn't decrease for 14500
steps, the training will stop.
When decoding the test set, we will use beam size 4
and use alpha value of 1.0.
The larger the alpha value, the longer the generated translation will be.
processed
indicates the version of the processed files. Here is config/processed.ru_zh.json
:
{
"version": "ru_zh",
"train": "train.ru_zh",
"dev": "dev.ru_zh",
"tests": [
"dev.ru_zh"
],
"bpe": true,
"vocab_size": 30000
}
It indicates that the training folder is data/raw/train.ru_zh
, dev folder is data/raw/dev.ru_zh
and test folder is data/raw/dev.ru_zh
, i.e. we use the dev as test.
The preprocessing pipeline will use byte-pair-encoding (BPE) and the number of merge operations are 30000
.
To train a Russian to Chinese NMT system:
conda activate mt_baseline
bash pipeline.sh config/run.ru_zh.big.single_gpu.json 1 4
1
is the start step and 4
is the end step.
After step 4, all the decoded results will be in folder data/run/ru_zh.big.single_gpu_tmp/decode
:
decode.b4_a1.0.test0.txt
: the decoded BPE subwords using beam size 4 and alpha value 1.0.decode.b4_a1.0.test0.tok
: the decoded tokens when we merge the BPE subwords into whole words.decode.b4_a1.0.test0.char
: the decoded utf8 characters of decode.b4_a1.0.test0.tok
after removing space.bleu.b4_a1.0.test0.tok
: the token level BLEU score.bleu.b4_a1.0.test0.char
: the character level BLEU score.The reference files are in folder data/run/ru_zh.big.single_gpu_tmp/decode
.
We have released the dev set on Codalab. You can submit your system outputs on Codalab to get the Bleu score on the released dev set. You can also download the dev set by registering to the competition on Codalab
Folder eval
contains the evaluation scripts to calculate the character-level BLEU score:
cd eval
python bleu.py hyp.txt ref.txt
Where hyp.txt
and ref.txt
can be either normal Chinese (i.e. without space between characters) or character-split Chinese.
See 'example.sh' for detailed examples.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。