# gpt2_lm_model

**Repository Path**: 18071569361/gpt2_lm_model

## Basic Information

- **Project Name**: gpt2_lm_model
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-06-05
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# GPT2-LM-MODEL

##Description 

**主要功能：**

使用GPT-2训练中文语言模型，目前只支持单GPU。采用ignite框架，代码较清晰，易理解。

    参考代码：

    1）https://github.com/huggingface/transfer-learning-conv-ai.git
    2）https://github.com/Morizeyao/GPT2-Chinese.git

**结构描述：**

	./config/  : 保存tokenizer及模型的配置文件
	./data/	   : 训练数据
	./cache/   : 缓存文件夹
    ./model/   : 保存模型的文件夹
    ./src/     : 源码文件夹
    run_tokenize.sh ： 预编码脚本
    run_train.sh ：训练脚本
##Usage
**源码下载：**

	git clone https://github.com/xinjianlv/gpt2_lm_model.git

**使用：**

1）使用demo数据：

	a)训练对话语料：直接运行run_train.sh脚本即可：
	--dialogue 值为1时，训练对话语料 
	--dataset_path 为数据路径选项，指定了原始数据，则会直接加载原始数据进行处理。
	b)训练小说语料：需要先运行run_tokenize.sh脚本对小说数据进行预处理。之后再运行run_train.sh脚本。
	--tokenized_data_path ：run_tokenize.sh脚本预处理后的文件夹路径。
	
2）模型使用：

	./src/interact.py
	--model_file 模型文件
	--vocab_file tokenizer使用的词表文件