# GPT2

**Repository Path**: 18071569361/GPT2

## Basic Information

- **Project Name**: GPT2
- **Description**: An implementation of training for GPT2, supports TPUs
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-05-16
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# GPT2
**Disclaimer: This is not the official GPT2 implementation! I've done my best to follow the specifications of the original GPT2 model as closely as possible, but be warned that I have not been able to replicate the full performance of the original model using this code. I don't know why this is, I haven't been able to track down any bug that could be causing this.**

An implementation of training for [GPT2](https://openai.com/blog/better-language-models/) that supports both GPUs and TPUs. The dataset scripts are a bit hacky and will probably need to be adapted to your needs. 
## Requirements
For GPUs:

`pip3 install tensorflow-gpu regex`

For TPUs:

`pip3 install tensorflow regex google-api-python-client oauth2client`

For downloading the models:

`pip3 install requests tqdm`

For generating the dataset (in addition to Tensorflow):

`pip3 install ftfy tqdm newspaper3k`

## Downloading Pretrained Models
If you want to use my models, I currently have "117M", "PrettyBig" and "1.5B" to offer. 117M was trained on a single v2 TPU for a week (probably less than the original OpenAI model), PrettyBig is slightly bigger than 345M and was trained on a v2-256 pod for a week. ~~I was originally also planning to release my version of the 1.5B model, but have decided against it. You can read about my reasoning [here](https://medium.com/@NPCollapse/the-hacker-learns-to-trust-62f3c1490f51).~~ Since OpenAI has released their model, I have now also released my (inferior) 1.5B model, which was trained on a v3-512 pod for a week.

`python3 download_model.py PrettyBig`

This will create two directories, one named as the model and another named "encoder". Change the "model_dir" and "encoder_path" parameters in the .json corresponding to your model to point to these paths, respectively.

If you only want the encoder, use:

`python3 download_model.py encoder`

## Generating Text
To predict you can either pass the prompt directly in the command line, or have it read from a file. (This is useful for prompts that include newlines) Text is output to the console and the file specified in the "predict_path" parameter. You need a model checkpoint and a copy of the BPE encoder at an accessible location for this to work. (Change the "model_dir" and "encoder_path" parameters in the .json)

From command line:

`python3 main.py --model Your-Model.json [--top_k Top-K-Truncation] --predict_text "Hello there! My name is"`

From file:

`python3 main.py --model Your-Model.json [--top_k Top-K-Truncation] --predict_file input.txt`

The optional top_k parameter causes the model to only consider the top k most likely tokens at each step. Setting this around 40 tends to create better results, but with less variety. 

Prediction on TPUs is not supported.


## Training
To train a model, define its parameters in a .json file (see examples) and then simply call

`python3 main.py --model Your-Model.json [--tpu Your-TPU-Name]`

Using a TPU is optional, it runs fine on GPUs without modification. (Note: Evaluation doesn't work on TPU pods and must be commented out) 

This assumes you have a version of the openwebtext corpus stored in an accessible location. If you don't, see below how to generate your own version.


## Generating the Dataset
GPT2 is trained on the webtext corpus, which is basically all websites linked to from Reddit with at least 3 Karma. Since the database is huge and contains a lot of copyrighted material, I can't provide a download here. Instead, I'll describe how I got it. Be aware it cost me around ~500€ in cloud compute resources to download and process the whole thing, but I'm not claiming I was optimally efficient. 
1. Use the download script from [here](https://github.com/jcpeterson/openwebtext) to download the archives (I used the prefiltered URLs file)
2. Use *datasets/openwebtext/
run_newspaper_extract.py* to extract the text
3. Once you have the raw .txt files use *datasets/openwebtext/
create_tfrecords.py* to encode them into .tfrecords files (Requires a copy of the encoder, see Downloading Pretrained Models)
4. Place the .tfrecords files into an accessible folder or Google Storage bucket (Placing in a Google Storage bucket is mandatory if you're using TPUs)
5. Change the "data_path" parameter in your .json to point to where your .tfrecords files are located and, if necessary, adapt the functions in *inputs.py* to open the correct filenames, in case you changed them

## Using Your Own Data
You can also use your own text files as training data, but you'll need to modify some code by hand.
1. Modify the parameters in *datasets/openwebtext/create_tfrecords.py*:

```python
base_dir = "/home/connor/my_text_dir" # Path to where your .txt files are located
files_per = 175000 # How many txt files to put in one tfrecord, not too important
name = "my-custom-data" # Name of output files will be name_i.tfrecords where i is the number of the file
output_dir = "/home/connor/output" # Where to place the .tfrecords files
log_dir = "logs" # Some logs will be placed here to support restarting if the encoding is interrupted
files = glob.glob(os.path.join(base_dir, "**/*.txt")) # This needs to result in a list of paths to all of your txt files
processes = 64 # Number of encoding processes to run
encoder_path = "/home/connor/encoder" # Path to encoder files
minimum_size = 128 # The minimum length (in BPE tokens) a file is allowed to have, otherwise it is discarded.
```
2. Run the script. This will result in a bunch of name_i.tfrecords files. Put these somewhere accessible (must be in a Google Storage bucket if you're using TPUs).
3. Create a new input function in *inputs.py*. Any input function should have the signature *function_name(params, eval=False)*. The **stitch** value controls how many texts are concatenated so that you never end up with a sample that is too small. It should be: **ceil((n_ctx+1) / minimum_size)** So for example, if my minimum size is 128 and my n_ctx is 1024, stitch should be 9.

```python
def my_input(params, eval=False):
    if not eval:
        numbers = [0, 3, 4, 5, 6, 7, 8, 9] # A random subset of files for train
    else:
        numbers = [1, 2] # Random subset for eval
    files = [os.path.join(params["data_path"], "my-custom-data_{}.tfrecords".format(str(i))) for i in numbers] # Generates the list of files

    return bpe_text(params["batch_size"], files, amount=params["n_ctx"], iterations=params["iterations"], stitch=9, batch=True)
```
4. Register your new input in *main.py*.

```python
inputs = {
    "openwebtext": openwebtext, # Standard OpenWebtext input
    "openwebtext_longbiased": openwebtext_longbiased, # OpenWebtext with a bias towards showing more long (>512 tokens) examples
    "openwebtext_long": openwebtext_long, # Openwebtext that only shows long examples
    "my_input": my_input,
}
```
5. Set your .json to use the new input.
```python
[...]
    "iterations": 500,
    "n_embd": 768,
    "input": "my_input",
    "model": "GPT2",
[...]
```
6. You're done. The input described here should be as close to GPT2 as possible and run perfectly on TPUs.

## Explanation of Parameters
Because passing two dozen parameters over the command line would be tedious, you pass all the model parameters in a .json file. Note that any paths also support Google Storage paths and *must* be gs:// paths if you're running on TPUs.

Values you'll definitely want to change:
* **model_path**: Where to save and load checkpoints from
* **data_path**: Where your .tfrecords files are located
* **encoder_path**: Path to the BPE encoder files. To get this, use the download_model.py script to download any model (or just the encoder). You will get a folder called "encoder". This is what you want this to point to (only required for prediction)

Values you'll probably want to change:
* **train_batch_size**: Batch size during training phase
* **eval_batch_size**: Batch size during evaluation
* **predict_batch_size**: Batch size during prediction
* **predict_path**: Where to save predictions (point this to a text file to append to)

Model parameters:
* **model**: A string that refers to which model to use. This should always just be "GPT2" (no other models are implemented here)
* **n_ctx**: Number of tokens the model looks at (default: 1024)
* **n_vocab**: Size of vocabulary (default: 50257)
* **n_embd**: Dimension of embedding layers
* **n_layer**: Number of layers in the model
* **n_head**: Number of attention heads (default: n_embd / 64)
* **scale_by_depth**: Whether or not to scale init by the number of layers (Default: true)
* **scale_by_in**: Whether to scale init by the number of input channels (Default: true)

Training parameters:
* **precision**: Whether to use float32 or bfloat16 variables (use "bfloat16" when training very large models) (optional, defaults to float32)
* **input**: Which input function to use (default: "openwebtext")
* **lr**: Learning rate (default: 0.00025)
* **warmup_steps**: Number of warmup steps. If this is set, a linear warmup + cosine decay schedule is used (default: 2000) (optional)
* **opt_name**: Name of optimizer, currently there are "adam" and "adafactor" (default: "adam")
* **weight_decay**: Weight decay parameter, if not present no weight decay is used (the weight decay fix for Adam is used) (default: 0.01) (optional)
* **beta1**: Adam/Adafactor beta1 parameter (adam default: 0.9, adafactor default: 0.0)
* **beta2**: Adam/Adafactor beta2 parameter (default: 0.98) (optional for adafactor with pow decay type)
* **epsilon**: Adam epsilon parameter (default: 1e-9)
* **decay_type**: Adafactor decay type, either "pow" or "adam" (default: "pow")
* **decay_exponent**: Adafactor pow decay exponent (default: 0.8)
* **train_steps**: Number of training steps to take between evaluations
* **eval_steps**: Number of steps per evaluation
* **max_steps**: The maximum number of training steps (important for declining lr)
* **iterations**: Number of iterations to perform on TPUs (Default: 100) (Only required for TPUs)
* **embed_dropout**: Dropout chance on the word embedding, set to 0 to disable (default: 0.1)
* **attn_dropout**: Dropout chance on attention layers, set to 0 to disable (default: 0.1)
* **res_dropout**: Dropout chance on residual connections, set to 0 to disable (default: 0.1)