# PRoBERTa

**Repository Path**: guanwen-chen/PRoBERTa

## Basic Information

- **Project Name**: PRoBERTa
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2022-04-08
- **Last Updated**: 2022-04-12

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


# PRoBERTa
Ananthan Nambiar, Maeve Heflin, Simon Liu, Sergei Maslov, Mark Hopkins, Anna Ritz

## Notes
- Links to Google Drive folders:

[BPE model](https://drive.google.com/drive/folders/1lJkG4IAWxSs8mGqSk-MjsaBQFV4Y3dhq?usp=sharing)

[pretraining data](https://drive.google.com/drive/folders/1glEeWDS0HoE_kYXqwySqoPpTuYstG_Dw?usp=sharing)

[family data](https://drive.google.com/drive/folders/1iXMEZuzDV1D0tstP6o_MjKzGab56_bOU?usp=sharing)

[conservative ppi data](https://drive.google.com/drive/folders/1XTCaBZOygCwoQmOZIt17dsS1XGVBq2Hu?usp=sharing)

[aggressive data](https://drive.google.com/drive/folders/1f2Z2fddQmwmVC30owNseE0lC8MSDEXBG?usp=sharing)

[pretrained weights](https://drive.google.com/drive/u/2/folders/1TbFjyRfbkLgJ_rlvO1SFB-ZvwQyykvK7)

[protein family finetuned weights](https://drive.google.com/drive/folders/1aro0V5yQgR53kLqKr9wYxjv_tVdbiw_m?usp=sharing)

[ppi conservative finetuned (20%) weights](https://drive.google.com/drive/folders/1D58Bzxm_t-MaRu7QQmb7VRlwfEvxytnD?usp=sharing)

[ppi conservative finetuned (100%) weights](https://drive.google.com/drive/folders/1djwhZE66N6SGT7qh-0yhzplpM4_H6CE-?usp=sharing)

[ppi aggressive finetuned (20%) weights](https://drive.google.com/drive/folders/1n58G7b_2ks_TPKh52FbuRrZW0LUT80t8?usp=sharing)

[ppi aggressive finetuned (100%) weights](https://drive.google.com/drive/folders/1lkzIa2DwYKiyP6TIglE2mAbSM1NnDDrB?usp=sharing)


## Requirements and Installation
sentencepiece tokenizer
```bash
pip3 install sentencepiece
```
Build [fairseq from linked repo source](https://github.com/imonlius/fairseq.git).
```bash
git clone https://github.com/imonlius/fairseq.git
cd fairseq
pip3 install --editable . --no-binary cffi
```
### tokenizer.py
Train a tokenizer and tokenize data for protein family and interaction fine-tuning

#### Example usage:
```bash
python3 tokenizer.py
```

- To change

| Name | Description |
| ----- | --------------------------------------- |
| path | Path to the protein family data. This should be a .tab file with "Sequence" and "Protein families" as two of the columns |
| int_path | Path to protein interaction data. This should be a json file with 'from', 'to' and 'link' for each interaction |

### pRoBERTa_pretrain.sh
 Pre-train RoBERTa model

#### Example Usage:
```bash
bash pRoBERTa_pretrain.sh pretrain 4 pretrained_model \
        pretraining/split_binarized/ \
        768 5 125000 3125 0.0025 32 64 3
```
- Arguments

| Name | Description | Example |
| ----- | ------------------------------------------ | ------ |
| PREFIX | Prefix for the model output files | pretrain |
| NUM_GPUS | Number of GPUs to be used during pretraining | 4 |
| OUTPUT_DIR | Output directory | [pretrained_model](https://drive.google.com/drive/u/2/folders/1fyb3RklnVWAUwajv20BP5smq9ypDgMl9) |
| DATA_DIR | Binarized input data directory | [pretraining/split_binarized/](https://drive.google.com/drive/u/2/folders/1inKxRuf5f3JBM2YDO1dQc-gTsdMn6VGR) |
| ENCODER_EMBED_DIM | Dimension of embedding generated by the encoders | 768 |
| ENCODER_LAYERS | Number of encoder layers in the model | 5 |
| TOTAL_UPDATES | Total (maximum) number of updates during training | 125000 |
| WARMUP_UPDATES | Total number of LR warm-up updates during training | 3125 |
| PEAK_LEARNING_RATE | Peak learning rate for training | 0.0025 |
| MAX_SENTENCES | Maximum number of sequences in each batch | 32 |
| UPDATE_FREQ | Updates the model every UPDATE_FREQ batches | 64 |
| PATIENCE | Early stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs | 3 |


### pRoBERTa_finetune_ppi.sh: 
Fine-tune RoBERTa model for Protein Interaction Prediction Task

#### Example Usage:
```bash
bash pRoBERTa_finetune_ppi.sh ppi 4 ppi_prediction \
        ppi_prediction/split_binarized/robustness_minisplits/0.80/ \
        768 5 12500 312 0.0025 32 64 2 3 \
        pretraining/checkpoint_best.pt \
        no
```
- Arguments

| Name | Description | Example |
| ----- | ------------------------------------------ | ------ |
| PREFIX | Prefix for the model output files | ppi |
| NUM_GPUS | Number of GPUs to use for finetuning | 4 |
| OUTPUT_DIR | Model output directory | [ppi_prediction](https://drive.google.com/drive/u/2/folders/1mS34_2YTBh2wZuvn9QF7m0254bnc2LE_) |
| DATA_DIR | Binarized input data directory | [ppi_prediction/split_binarized/robustness_minisplits/1.00](https://drive.google.com/drive/u/2/folders/1kjNnud51AIPu_eeuqdapHHE-GVoaHfZm) |
| ENCODER_EMBED_DIM | Dimension of embedding generated by the encoders | 768 |
| ENCODER_LAYERS | Number of encoder layers in the model | 5 |
| TOTAL_UPDATES | Total (maximum) number of updates during training | 12500 |
| WARMUP_UPDATES | Total number of LR warm-up updates during training | 3125 |
| PEAK_LEARNING_RATE | Peak learning rate for training | 0.0025 |
| MAX_SENTENCES | Maximum number of sequences in each batch | 32 |
| UPDATE_FREQ | Updates the model every UPDATE_FREQ batches | 64 |
| PATIENCE | Early stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs | 3 |
| PRETRAIN_CHECKPOINT | Path to pretrained model checkpoint | [pretraining/checkpoint_best.pt](https://drive.google.com/drive/u/2/folders/1TbFjyRfbkLgJ_rlvO1SFB-ZvwQyykvK7) |
| RESUME_TRAINING | Whether to resume training from previous finetuned model checkpoints | no |

### pRoBERTa_finetune_pfamclass.sh:
Fine-tune RoBERTa model for Family Classification Task

#### Example Usage:
```bash
bash pRoBERTa_finetune_pfamclass.sh family 4 family_classification \
        family_classification/split_binarized/robustness_minisplits/1.00 \
        768 5 12500 312 0.0025 32 64 4083 3 \
        pretraining/checkpoint_best.pt \
        no
```
- Arguments

| Name | Description | Example |
| ----- | ------------------------------------------ | ------ |
| PREFIX | Prefix for the model output files | family |
| NUM_GPUS | Number of GPUs to use for finetuning | 4 |
| OUTPUT_DIR | Model output directory | [family_classification](https://drive.google.com/drive/u/2/folders/1EGvJEAVDfPb1gcxPUsr92Tan9rPgasGm) |
| DATA_DIR | Binarized input data directory | [family_classification/split_binarized/robustness_minisplits/1.00](https://drive.google.com/drive/u/2/folders/1VxNHbwWqVZsnnZwA-6gjFtxkXB55tX3y) |
| ENCODER_EMBED_DIM | Dimension of embedding generated by the encoders | 768 |
| ENCODER_LAYERS | Number of encoder layers in the model | 5 |
| TOTAL_UPDATES | Total (maximum) number of updates during training | 12500 |
| WARMUP_UPDATES | Total number of LR warm-up updates during training | 3125 |
| PEAK_LEARNING_RATE | Peak learning rate for training | 0.0025 |
| MAX_SENTENCES | Maximum number of sequences in each batch | 32 |
| UPDATE_FREQ | Updates the model every UPDATE_FREQ batches | 64 |
| PATIENCE | Early stop training if valid performance doesn’t improve for PATIENCE consecutive validation runs | 3 |
| PRETRAIN_CHECKPOINT | Path to pretrained model checkpoint | [pretraining/checkpoint_best.pt](https://drive.google.com/drive/u/2/folders/1TbFjyRfbkLgJ_rlvO1SFB-ZvwQyykvK7) |
| RESUME_TRAINING | Whether to resume training from previous finetuned model checkpoints | no |


### Clustering/protein_family_clustering_loop.py
Cluster proteins using k-means and calculate the normalized mutual information (NMI) with protein families. Before running this make sure to download roberta.base and the relevant checkpoints.

#### Example Usage:
```bash
python3 protein_family_clustering_loop.py
```
- To change

| Name | Description |
| ----- | ----------------------------------- |
| tokenized_data_filepath | Input data filepath. This file has to contain tokenized protein sequences in a 'Tokenized Sequence' column, and the family each protein belongs to in a 'Protein families' column. Any other columns in this file will be ignored. |
| roberta_weights | depending on whether you're using a pretrained or fine-tuned model, choose the appropriate weights |
| EMBEDDING_SIZE | Should match the PRoBERTa model size |
| USE_NULL_MODEL | Whether to use random cluster prediction instead of k-means clustering |


### pRoBERTa_evaluate_family_batch.py: 
Predict families using fine-tuned RoBERTa model

#### Example Usage:
```bash
python3 pRoBERTa_evaluate_family_batch.py family_classification/split_tokenized/full/Finetune_fam_data.split.test.10 \
	family_classification/split_binarized/robustness_minisplits/1.00/ \
	predictions.tsv \
	family_classification/checkpoints/ \
	protein_family_classification 256
```
- Arguments


| Name | Description | Example |
| ----- | ------------------------------------------ | ------ |
| DATA | Path to input examples to predict. This should be formatted as a CSV with the columns, in order: tokenized sequence, true family label | [family_classification/split_tokenized/full/Finetune_fam_data.split.test.10](https://drive.google.com/drive/u/2/folders/1CvZPrtqs_JqxJVG3Fk-2FwUEC5R7NNUU) |
| BINARIZED_DATA | Path to binarized family data | [family_classification/split_binarized/robustness_minisplits/1.00/](https://drive.google.com/drive/u/2/folders/1VxNHbwWqVZsnnZwA-6gjFtxkXB55tX3y) |
| OUTPUT | Path to output file with model predictions | [predictions.tsv](https://drive.google.com/drive/u/2/folders/10gpJUzyjPCT12GfqUexOFcjFoTW9Rcr4) |
| MODEL_FOLDER | Model checkpoints folder. Will use checkpoint_best.pt file in the folder. | [family_classification/checkpoints/](https://drive.google.com/drive/u/2/folders/1JgEfybT6wT8MGzaxgAUKI7dH6W0oWLBn) |
| CLASSIFICATION_HEAD_NAME | Name of the trained classification head | protein_family_classification |
| BATCH_SIZE | Batch size for prediction | 256 |

### pRoBERTa_evaluate_ppi_batch.py: 
Predict PPI using fine-tuned RoBERTa model

#### Example Usage:
```bash
python3 pRoBERTa_evaluate_ppi_batch.py ppi_prediction/split_tokenized/full/Finetune_interact_tokenized.split.test.10 \
	ppi_prediction/split_binarized/robustness_minisplits/1.00/ \
	predictions.tsv \
	ppi_prediction/checkpoints/ \
	protein_interaction_prediction 256
```
- Arguments:


| Name | Description | Example |
| ----- | ------------------------------------------ | ------ |
| DATA | Path to input examples to predict. This should be formatted as a CSV with the columns, in order: tokenized from sequence, tokenized to sequence, true label | [ppi_prediction/split_tokenized/full/Finetune_interact_tokenized.split.test.10](https://drive.google.com/drive/u/2/folders/1GxGGOqQz5LvlLoTW3EnuEEr7fKwmu8ju) |
| BINARIZED_DATA | Path to binarized PPI data | [ppi_prediction/split_binarized/robustness_minisplits/1.00/](https://drive.google.com/drive/u/2/folders/1kjNnud51AIPu_eeuqdapHHE-GVoaHfZm) |
| OUTPUT | Path to output file with model predictions | [predictions.tsv](https://drive.google.com/drive/u/2/folders/1mS34_2YTBh2wZuvn9QF7m0254bnc2LE_) |
| MODEL_FOLDER | Model checkpoints folder. Will use checkpoint_best.pt file in the folder. | [ppi_prediction/checkpoints/](https://drive.google.com/drive/u/2/folders/1PvcqbJbgjUNMgoYhTNCsZ_a2oEAIBjxV) |
| CLASSIFICATION_HEAD_NAME | Name of the trained classification head | protein_interaction_prediction |
| BATCH_SIZE | Batch size for prediction | 256 |

### shuffle_and_split_pretrain.sh:
Shuffle and split pretraining data file into training, validation, and test data files.

#### Example Usage:
```bash
bash shuffle_and_split_pretrain.sh pretraining/tokenized_seqs_v1.txt \
	pretraining/split_tokenized/ \
	tokenized_seqs_v1
```
- Arguments:


| Name | Description | Example |
| ----- | ------------------------------------------ | ------ |
| INPUT | Input file. Each line should be an example. | [pretraining/tokenized_seqs_v1.txt](https://drive.google.com/drive/u/2/folders/1glEeWDS0HoE_kYXqwySqoPpTuYstG_Dw) |
| OUTPUT | Output directory | [pretraining/split_tokenized/](https://drive.google.com/drive/u/2/folders/10HwZrooDzT3wsY3w7GNRgwA5UFzVJ0Dj) |
| PREFIX | Prefix for output files | tokenized_seqs_v1 |

### shuffle_and_split.sh:
Shuffle and split finetuning data file into training, validation, and test data files.

#### Example Usage:
```bash
bash shuffle_and_split.sh family_classification/Finetune_fam_data.csv \
	family_classification/split_tokenized/full/ \
	Finetune_fam_data
```
- Arguments:

| Name | Description | Example |
| ----- | ------------------------------------------ | ------ |
| INPUT | Input file. Each line should be an example. | [family_classification/Finetune_fam_data.csv](https://drive.google.com/drive/u/2/folders/1mcDfv_rHsYltIW5CimMmzu7_BF7rHiN2) |
| OUTPUT | Output directory | [family_classification/split_tokenized/full/](https://drive.google.com/drive/u/2/folders/1CvZPrtqs_JqxJVG3Fk-2FwUEC5R7NNUU) |
| PREFIX | Prefix for output files | Finetune_fam_data |

### percentage_splits.sh
Generate output files with a certain percentage of the input data file

#### Example Usage:
```bash
bash percentage_splits.sh family_classification/split_tokenized/full/Finetune_fam_data.split.train.80 \
	family_classification/split_tokenized/full/robustness_split
	Finetune_fam_data
```
- Arguments:

| Name | Description | Example |
| ----- | ------------------------------------------ | ------ |
| INPUT | Input file | [family_classification/split_tokenized/full/Finetune_fam_data.split.train.80](https://drive.google.com/drive/u/2/folders/1CvZPrtqs_JqxJVG3Fk-2FwUEC5R7NNUU) |
| OUTPUT | Output directory | [family_classification/split_tokenized/full/robustness_split](https://drive.google.com/drive/u/2/folders/1EVWrfF9GVUb_b9MnNjyEHyFz0SuIapcy) |
| PREFIX | Prefix for output files| Finetune_fam_data |

### Preprocess/binarize pretraining data:
```bash
fairseq-preprocess \
	--only-source \
	--trainpref tokenized_seqs_v1.split.train.80 \
	--validpref tokenized_seqs_v1.split.valid.10 \
	--testpref tokenized_seqs_v1.split.test.10 \
	--destdir pretraining/split_binarized \
	--workers 60
```

### Preprocess/binarize family classification finetuning data:
```bash
# Split data into sequence and family files
for f in family_classification/split_tokenized/full/Finetune*; do
	cut -f1 -d',' "$f" > family_classification/split_tokenized/sequence/$(basename "$f").sequence
	cut -f2 -d',' "$f" > family_classification/split_tokenized/family/$(basename "$f").family
done

# Replace all spaces in family names with underscores
for f in family_classification/split_tokenized/family/*.family; do
	sed -i 's/ /_/g' "$f"
done

# Generate family label dictionary file
awk '{print $0,0}' family_classification/split_tokenized/family/*.family | sort | uniq > \
	family_classification/split_tokenized/family/families.txt

# Binarize sequences
fairseq-preprocess \
	--only-source \
	--trainpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.train.80.sequence
        --validpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.valid.10.sequence
        --testpref family_classification/split_tokenized/sequence/Finetune_fam_data.split.test.10.sequence
	--destdir family_classification/split_binarized/input0
	--workers 60
	--srcdict pretraining/split_binarized/dict.txt

# Binarize labels
fairseq-preprocess \
	--only-source \
	--trainpref family_classification/split_tokenized/family/Finetune_fam_data.split.train.80.family
	--validpref family_classification/split_tokenized/family/Finetune_fam_data.split.valid.10.family
	--testpref family_classification/split_tokenized/family/Finetune_fam_data.split.test.10.family 
	--destdir family_classification/split_binarized/label
	--workers 60
	--srcdict family_classification/split_tokenized/family/families.txt
```

### Preprocess/binarize PPI data:
```bash
# Split data into from sequence, to sequence, and label files
for f in ppi_prediction/split_tokenized/full/Finetune*; do
        cut -f1 -d',' "$f" > ppi_prediction/split_tokenized/from/$(basename "$f").from
        cut -f2 -d',' "$f" > ppi_prediction/split_tokenized/to/$(basename "$f").to
	cut -f2 -d',' "$f" > ppi_prediction/split_tokenized/label/$(basename "$f").label
done

# Binarize sequences
fairseq-preprocess \
        --only-source \
        --trainpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.train.80.from
        --validpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.valid.10.from
        --testpref ppi_prediction/split_tokenized/from/Finetune_interact_tokenized.split.test.10.from
        --destdir ppi_prediction/split_binarized/input0
        --workers 60
        --srcdict pretraining/split_binarized/dict.txt

fairseq-preprocess \
        --only-source \
        --trainpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.train.80.to
        --validpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.valid.10.to
        --testpref ppi_prediction/split_tokenized/to/Finetune_interact_tokenized.split.test.10.to
        --destdir ppi_prediction/split_binarized/input1
        --workers 60
        --srcdict pretraining/split_binarized/dict.txt

# Binarize labels
fairseq-preprocess \
	--only-source \
	--trainpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.train.80.label
        --validpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.valid.10.label
        --testpref ppi_prediction/split_tokenized/label/Finetune_interact_tokenized.split.test.10.label
	--destdir ppi_prediction/split_binarized/label
	--workers 60
```