# Knover **Repository Path**: CharlieShark/Knover ## Basic Information - **Project Name**: Knover - **Description**: Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-08-12 - **Last Updated**: 2021-08-12 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out efficient training/inference of large-scale dialogue generation models. ### What's New: - July 2020: We are opening [PLATO-2](plato-2/README.md), a large-scale generative model with latent space for open-domain dialogue systems. ## Basic usage: ### Training Carry out local training with a configuration file. You can specify GPU by `export CUDA_VISIBLE_DEVICES=XXX` in `./scripts/local/train.sh`. You can specify other environment variables in the script. ``` bash ./scripts/local/train.sh ${TRAIN_CONF} ``` An example of training configuration files is `./package/dialog_en/plato/24L_train.conf`. It contains three sections: `job`, `task` and `training`. #### job This section defines: `job_script`: the main script of this task; use `./scripts/distributed/train.sh` for training task. #### task This section defines: `model`: the used model class `task`: task name `vocab_path`: vocabulary path tokenizer related: `spm_model_file` for SentencePieces Tokenizer, and so on. dataset files related: `train_file`, `valid_file`, `data_format` and `file_format`. `config_path`: model configuration file. Choices of `data_format`: - `raw`: untokenized data tsv file, example: `./data/train.tsv`, each column is a field. - `tokenized`: tokenized data tsv, example: `./data/train_tokenized.tsv` which is generated by `./tools/pre_tokenized.sh`. - `numerical`: each line contains numerical data (`token_ids`, `type_ids` and `pos_ids`, `role_ids` for optional) , example: `./data/train.numerical.tsv` which is generated by `./tools/pre_numericalize.sh`. It also supports the file with `.gz` suffix which is compressed by `gzip` command. Choices of `file_format`: - `file`: a file only. - `filelist`: contains multiple files, each line is a file, example: `./data/train_filelist`. #### training This section defines training related settings: `init_params`: initialized parameters. `init_checkpoint`: initialized checkpoints (contains not only the parameters of the model, but also the persitables of the optimizer) . You can also set `train_args="--start_step 1000"` for better display of log (if you continue training from step 1000) , but this is not necessary. `batch_size`, `lr`, `num_epochs` and so on. `log_dir`: the output path of training logs, include the log file (`${log_dir}/workerlog.${DEV_ID}`) of each GPU trainer. If `log_dir=""`, then the output of all GPU trainers will output to standard output. `save_path`: the output path of saved parameters. You can define other arguments in training script, such as: ``` train_args="--max_src_len 384 --max_seq_len 512" ``` ## Disclaimer This project aims to facilitate further research progress in dialogue generation. Baidu is not responsible for the 3rd party's generation with the pre-trained system. ## Contact information For help or issues using Knover, please submit a GitHub issue.