# TERL

**Repository Path**: dnastories_dengcao/TERL

## Basic Information

- **Project Name**: TERL
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-11-04
- **Last Updated**: 2024-06-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Important

We ask you to cite the main publication related to this software whenever you use any part of this software in any scientific publication.

You may use the following .bibtex to cite the main publication of this software:
```
@article {da Cruz2020.03.25.000935,
	author = {da Cruz, Murilo Horacio Pereira and Domingues, Douglas Silva and Saito, Priscila Tiemi Maeda and Paschoal, Alexandre Rossi and Bugatti, Pedro Henrique},
	title = {TERL: Classification of Transposable Elements by Convolutional Neural Networks},
	elocation-id = {2020.03.25.000935},
	year = {2020},
	doi = {10.1101/2020.03.25.000935},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2020/03/26/2020.03.25.000935},
	eprint = {https://www.biorxiv.org/content/early/2020/03/26/2020.03.25.000935.full.pdf},
	journal = {bioRxiv}
}
```

# Instalation

To install TERL you need to clone the repository into your local machine. First you need to have git installed in your local machine. You can follow [these steps](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) to install git.
Once you have git installed, you can clone this repository with the following command:
```
git clone https://github.com/muriloHoracio/TERL
```

After the clone, you have a directory named TERL which contains all codes to train TERL and classify sequences.

Since TERL is made on Python 3.6 and use some libraries, we recomend the use of virtual environmnets to run it. Before installing virtualenv, make sure you have pip3 installed. To install pip3 run the followoing command:
```
sudo apt-get install python3-pip
```
Update pip3 by running the following command:
```
sudo -H pip3 install --upgrade pip
```
Also, you need to install setup-tools by running the following command:
```
sudo apt-get install python3-setuptools
``` 
To create a virtual environment you need to have virtualenv installed, in order to do that you can run the following command:
```
sudo apt-get install virtualenv
```
To create the virtual environment, you need to execute the following command:
```
virtualenv -p python .venv
```
Once the virtual environment is created, you need to active the environment in order to install the dependencies of TERL. To do this, run the following command:
```
. .venv/bin/activate
```
If everything worked well, you will notice that ``(.venv)`` will appear before the user name in the command line.

Now you must install the dependencies needed to run TERL. 

If you are using GPU, you must run the following command:
```
pip3 install -r requirements-gpu.txt
```
Otherwise, you must run the following command:
```
pip3 install -r requirements.txt
```
If everything went well, you have just installed TERL and are ready to train your on model on your sequences or classify some sequences based on a previously trained model.

# TERL

Transposable Elements Representation Learner

The TERL can be used to classify any genomic sequence.
This framework provides tools to train and test models.
Users can opt to deploy the trained network model.
There are vast parameters that can be used to define the network architecture and to set the model's parameters.
All the set of parameters are described here with examples of usage.

## Dataset Organization
In order to use this framework to train and test CNNs models for genomic data, users need to organize the structure of dataset's files
The files should be stored in the following way:

```
Root
└─── Train
|    └─── Class1.fa
|    └─── Class2.fa
|    └─── Class3.fa
|    └─── Class4.fa
└─── Test
     └─── Class1.fa
     └─── Class2.fa
     └─── Class3.fa
     └─── Class4.fa
```
The filenames on Train and Test folders must be identicals and reflect the class that each file represents.

## Train Usage (terl_train.py)
This is an example how to train a model with TERL. The files are stored over Train and Test folders, which are stored on Dataset folder, sotored in the TERL folder.

```
Dataset
└─── Train
|    └─── LTR.fa
|    └─── LINE.fa
|    └─── SINE.fa
└─── Test
     └─── LTR.fa
     └─── LINE.fa
     └─── SINE.fa
```

This example model have the following architecture:

```
       Architecture: conv    pool    conv    pool    fc      fc      
          Functions: relu    avg     relu    avg     relu    relu    
             Widths: 30      20      30      20      1500    500     
            Strides: 1       20      1       20      
       Feature maps: 64      -       32      -       -       -       
```

Example:
```
python3 terl_train.py -r Dataset -l 6 -a conv pool conv pool fc fc -f relu avg relu avg relu relu -w 30 20 30 20 1500 500 -s 1 20 1 20 -fm 64 32 -sg -sr -sm
```

## Train Parameters
This section describes the parameters with its possible values and examples of usage.
### -r, --root
**Required** parameter that defines the Root folder, where Train and Test folders containing sample sequences files are located.

It can be the relative or absolute path to Root

Example:
```
python3 terl_train.py -r ~/TERL/Datasets/DS1
```

### -l, --layers
Parameter that defines the number of layers that will be created. It checks if the number of defined layers 
in the model is correct. The layers can be defined without this parameter, but it is a good practice to use 
it to guarantee that the model have the correct number of layers.

Default value is 8, which is the number of layers of the default model.

Example:
```
python3 terl_train.py -l 6
```

### -a, --architecture
Parameter that defines the architecture of the model. This defines the types of the layers and its order in the model.

**Input and classification layer should not be included**.

The supported values are:
* conv (Convolution layer)
* pool (Pooling layer)
* fc (Fully connected layer)

Default value is: conv pool conv pool conv pool fc fc

Example:
```
python3 terl_train.py -a conv pool conv pool fc fc
```

### -f, --functions
Parameter that defines the functions of each layer. The functions should be entered according to the --architecture parameter, i.e. the first option should be the function of the first layer defined in --architecture, the second option the function of the second layer and so on...

The available activation functions for convolution and fully connected layers are:
* relu
* tanh
* sigmoid
* leaky_relu
* elu

The available funcions for pooling layers are:
* avg
* max

Default value is: -f relu avg relu avg relu avg relu relu

Example:
```
python3 terl_train.py -f relu avg relu avg relu relu
```

### -w, --widths
Parameter that defines the widths of the filters (convolution and pooling) and the number of neurons for fully connected layers. The values should be entered according to the --architecture parameter, i.e. the first value should be the width of the first layer filter, the second value should be the width of the second layer's filter, and so on...

Default parameter is: -w 30 20 30 20 30 10 1500 500

Example:
```
python3 terl_train.py -w 30 20 30 20 1500 500
```

### -s, --strides
Parameter that defines the strides of the layers of the model. The values should be entered according to the parameter --architecture, i.e. the first value should be the stride of the first layer, the second value the stride of the second layer, and so on...

Default value is: -s 1 20 1 20 1 10

Example:
```
python3 terl_train.py -s 1 20 1 20
```

### -fm, --feature-maps
Parameter that defines the amount of feature maps of each convolution layer. It should be entered **n** values for a network with **n** convolution layers.

Default value is: -fm 64 32 16

Example:
```
python3 terl_train.py -fm 64 32
```

### -o, --optimizer
Parameter that defines the optimizer that will be used to train the model and optimize the values of the learnable parameters (i.e. weights) of the model.

The available optimizers are:
* adam
* adadelta
* adagrad
* ftrl
* rmsprop
* grad_desc

Default value is: -o adam

Example:
```
python3 terl_train.py -o adagrad
```

### -lr, --learning-rate
Parameter that defines the learning rate to be used by the optimizer.

Default value is: -lr 0.001

Example:
```
python3 terl_train.py -lr 0.001
```

### -trb, --train-batch-size
Parameter that defines the train batch size. The train batch is the amount of samples that will be presented to the network for each step during training.

Default value is: 32

Example:
```
python3 terl_train.py -trb 64
```

### -tsb, --test-batch-size
Parameter that defines the test batch size. The test batch is the amount of samples that will be presented to the network for each step during testing.

Default value is: 32

Example:
```
python3 terl_train.py -tsb 64
```

### -e, --epochs
Parameter that defines the number of epochs that training will be executed. In each epoch all training samples are presented to the network during training.

Default value is: 30

Example:
```
python3 terl_train.py -e 100
```

### -d, --dropout
Parameter that defines the dropout rate that is used to drop neurons in each convolution and fully connected layer in the model.

Default value is: 0.5

Example:
```
python3 terl_train.py -d 0.3
```

### -sg, --save-graphs
Parameter that sets confusion matrix and learning curve graphs to be saved. The title of the graphs are defined in the --confusion-matrix-title and --learning-curve-title parameters.

By default, graphs are not saved, meaning you need to set it if you really want to save them.

Example:
```
python3 terl_train.py -sg
```

### -cmt, --confusion-matrix-title
Parameter that defines the title of the confusion matrix graph. The title should not contain the character "-".

Default value is: Confusion Matrix

Example:
```
python3 terl_train.py -cmt Confusion Matrix DS1
```

### -lct, --learning-curve-title
Parameter that defines the title of the learning curve graph. The title should not contain the character "-".

Default value is: Learning Curve

Example:
```
python3 terl_train.py -lct Learning Curve DS1
```

### -p, --prefix
Parameter that defines the prefix name to be used to save files, e.g. graphs, models and reports. The name must be one string, i.e. without spaces.

Default value is: RUN_yyyymmdd_HHMMSS

Where yyyy is the 4 digit current year, mm is the 2 digit current month, dd is the 2 digits current day, hh, mm, and ss is the current hour, minute and second respectively.

Example:
```
python3 terl_train.py -p DS1_Tests
```

### -sm, --save-model
Parameter that sets the model to be saved on the directory defined on --model-export-dir.

By default, the model is not saved. Users who want to save their models must set it with this parameter.

Example:
```
python3 terl_train.py -sm
```

### -md, --model-export-dir
Parameter that defines the folder where the model will be exported. The value should be the relative or absolute path to the desired folder. We suggest the use of folder Models created on the folder TERL.

Default value is: Models/Model_yyyymmdd_HHMMSS

Where yyyy is the 4 digit current year, mm is the 2 digit current month, dd is the 2 digits current day, hh, mm, and ss is the current hour, minute and second respectively.

Example:
```
python3 terl_train.py -md Models/DS1_Model
```

### -sr, --save-report
Parameter that sets the reports to be saved on the folder Outputs that is located in the TERL folder.

By default, reports are not saved. Users who want to save it must set it with this parameter.

Example:
```
python3 terl_train.py -sr
```

### -sm, --save-model
Parameter that sets the model to be saved on the directory defined on --model-export-dir.

By default, the model is not saved. Users who want to save their models must set it with this parameter.

Example:
```
python3 terl_train.py -sm
```

### -nv, --no-verbose
Parameter that disables the verbose mode, which provides useful information to the user.

The verbose mode shows the following information:
* OPTIONS (all parameters used)
* FILES (training and testing file)
* CLASSIFICATION INFO (classes, train and test size, longest sequence and vocabulary size)
* Accuracy micro, macro and simple after each epoch
* REPORT (confusion matrix and classification metrics)
* TIME (train and test times)

By default, verbose is on. Users who want to disable it must set it with this parameter.

Example:
```
python3 terl_train.py -nv
```
## Test/Classification Usage (terl_test.py)
This is an example how to test TERL or classify files. You must inform a trained and saved model to perform this operation.

Example:
```
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa 
```

After classification is done, three files with prefix ``TERL_YYYYmmdd_HHMMSS_`` will be created containing the results of the classification. TERL copies the sequences and changes the header according to the predicted class.

## Test/Classification Parameters
This section describes the parameters with its possible values and examples of usage.

### -m, --model
**Required** parameter that defines the model to be used for classification.

Example:
```
python3 terl_test.py -m Models/TERLModel
```

### -f, --files
Parameter that defines the FASTA files to be classified. After classifying the files, output files are created with a prefix name containing the sequences in the original file and the headers with the predicted classes.

Default value is TERL_YYYYmmdd_HHMMSS_ where YYYY, mm, dd, HH, MM and SS means the current year, month, day, hour, minutes and seconds.

Example:
```
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa
```

### -b, --batch
Parameter that defines the batch size that will be used to load sequences and classify them.

Default value is 32

Example:
```
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa -b 64
```

### -p, --prefix
Parameter that defines the prefix to be used when writing the output files.

Default value is TERL_YYYYmmdd_HHMMSS_

Example:
```
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa -p TERL_exp1_
```

Which will results in the following output files:
```
TERL_exp1_file1.fa
TERL_exp1_file2.fa
TERL_exp1_file3.fa
```

### -q --quiet
Parameter that deactivates verbose mode, which prints a lot of useful information. 

Default value is False, which prints useful information in the terminal screen

Example:
```
python3 terl_test.py -m Models/TERLModel -f file1.fa file2.fa file3.fa -q
```

The above command will log only Tensorflow's logs