# TALE

**Repository Path**: altriasjy31/TALE

## Basic Information

- **Project Name**: TALE
- **Description**: import from github for efficient downloading
https://github.com/Shen-Lab/TALE.git
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-07-26
- **Last Updated**: 2022-05-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# TALE
Transformer-based protein function Annotation with joint sequence-Label Embedding


![TALE Architecture](/ProteinFuncPred.png)


Joint Feature-Label Embedding 

Input feature: sequence data (using transformer) 

Output label: hierarchical nodes on directed graphs

## Dependencies
* TensorFlow >=1.13
* For TALE+ (TALE+Diamond), please download [Diamond](http://www.diamondsearch.org/index.php) and put the executable file into TALE/diamond/


## For users
### If you want to use TALE+ for prediction, prepare your sequence file in the fasta format and go to src/ and run:
`python predict.py --input_seq $path_to_your_fasta_file --ontology on --outputpath $path_to_your_output_file`

where on=mf,bp,cc for MFO,BPO and CCO, respectively.

## For developers
### Training and test data:
* Under 'Data/CAFA3' and 'Data/ours'
* train_seq_mf: The training sequence file for MFO 
* train_label_mf: The training label file for MFO
* test_seq_mf: The test sequence file for MFO
* test_label_mf: The test label file for MFO
* mf_go_1.pickle: The ontology file for MFO

### Data formats:
#### Sequence
The sequence file is a list, where each element is a directory having the following information:
* 'ID': The ID of the sequence in Swiss-Prot
* 'ac': The acession number of the sequence in Swiss-Prot
* 'date': The date of the sequence released in Swiss-Prot
* 'seq': The amino acid sequence
* 'GO':  The GO annotations of the sequence
#### Label 
* The label file is a list, where each element is a list containing the indexes of labels (GO terms).
#### Ontology 
The ontology file is a directory, where each key is a GO term (e.g. 'GO:0030234') in the ontology. Each value is also a directory containing the information for that key:
* 'name': The name of the GO term
* 'ind':  The index of this GO term
* 'father': The parent GO terms
* 'child': The children GO terms

### Training:
In order to train the model, under src/, run:

`python train.py --batch_size 32 --epochs 100 --lr 1e-3 --save_path ./log/ --ontology mf --data_path ../data/ --regular_lambda 0`

The above example is to train a model with 32 batch size, 100 epochs, 1e-3 learning rate, MFO ontology, 0 lambda value, with training data path at '../data/Gene_Ontology/EXP_Swiss_Prot/' and save the trained model in './log/'.

### Trained models:
The trained models are in 'trained_models/'. (e.g. Our_modelk_MFO* is the kth best model on MFO trained on our dataset; CAFA3_modelk_MFO* is the kth best model on MFO trained on CAFA3 dataset.)


## Citation
```
@article{10.1093/bioinformatics/btab198,
    author = {Cao, Yue and Shen, Yang},
    title = "{TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding}",
    journal = {Bioinformatics},
    year = {2021},
    month = {03},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btab198},
    url = {https://doi.org/10.1093/bioinformatics/btab198},
    note = {btab198},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btab198/36671287/btab198.pdf},
}
```


## Contact:
Yang Shen: yshen@tamu.edu

Yue Cao:  cyppsp@tamu.edu