# ECNet **Repository Path**: yang-benben/ECNet ## Basic Information - **Project Name**: ECNet - **Description**: No description available - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-04-11 - **Last Updated**: 2022-04-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # ECNet An evolutionary context-integrated deep learning framework for protein engineering - [ECNet](#ecnet) - [Overview](#overview) - [Installation](#installation) - [Dependencies](#dependencies) - [Quick Example](#quick-example) - [Running on your own data](#running-on-your-own-data) - [Generate local features using HHblits and CCMPred](#generate-local-features-using-hhblits-and-ccmpred) - [Train on dataset A and test on dataset B](#train-on-dataset-a-and-test-on-dataset-b) - [Citation](#citation) - [Contact](#contact) ## Overview ECNet (evolutionary context-integrated neural network) is a deep learning model that guides protein engineering by predicting protein fitness from the sequence. It integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. Please see our *Nature Communications* [paper](https://doi.org/10.1038/s41467-021-25976-8) for details. ![ECNet](doc/overview.png) ## Installation Clone and export the GitHub repository directory to python path ```bash git clone https://github.com/luoyunan/ECNet.git cd ECNet export PYTHONPATH=$PWD:$PYTHONPATH ``` ## Dependencies This package is tested with `Python 3.7` and `CUDA 10.1` on `Ubuntu 18.04`, with access to an Nvidia GeForce TITAN X GPU (12GB RAM) and Intel Xeon E5-2650 v3 CPU (2.30 GHz, 512G RAM). Please see `requirements.txt` for necessary python dependencies, all of which can be easily installed with `pip` or `conda`. Due to an issue of installing `pytorch 1.4.0` with `pip`, please install `pytorch` with `conda` first. ```bash conda install pytorch==1.4.0 cudatoolkit=10.1 -c pytorch pip install -r requirements.txt ``` ## Quick Example 1. Download example data (~102MB) from Dropbox. ``` wget https://www.dropbox.com/s/nkgubuwfwiyy0ze/data.tar.gz tar xf data.tar.gz ``` 2. Run the example script. The following script trains an ECNet model using the fitness data of DNA methylase HaeIII ([source](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004421)). The scripts randomly splits 70% as training data, 10% as validation data, and 20% as test data. ```bash CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py --train data/MTH3_HAEAESTABILIZED_Tawfik2015.tsv --fasta data/MTH3_HAEAESTABILIZED_Tawfik2015.fasta --local_feature data/MTH3_HAEAESTABILIZED_Tawfik2015.braw --output_dir ./output --save_prediction --save_checkpoint ``` It typically takes no more than 15 min on our tested environment to run this example. The output (printed to stdout) would be the correlation between predicted and ground-truth fitness values. ## Running on your own data ECNet has two required input files: 1) a FASTA file of the wild-type sequence, and 2) a TSV file describes the fitness values of variants. Other optional input files include the output of CCMPred for extracting local features and separate test TSV file. 1. **Sequence FASTA file** (`--fasta`, required). A regular FASTA file of the wild-type sequence. This file should contain only one sequence. 2. **Fitness TSV file** (`--train`, required). Each line has two columns `mutation` and `score` separated by tab, describing the fitness value of a variant. The `mutation` column is a string has the format `[ref][pos][alt]`, e.g., `S100T`, meaning that the 100-th amino acid (index starting from 1) mutated from `S` to `T`. If a variant has multiple mutations, `;` is used to concatenated mutations. The `score` column is a numerical value quantifies the variant's fitness. Example: ``` mutation score M1S 1.0 F12I;L30K 2.0 G89A 0.06 ``` Note: This file is supplied using the `--train` argument. If no separate test data is provided through the `--test` argument, this TSV file will be split into three sets (train, valid, and test) using ratio specified by `--split_ratio` (which are 3 float numbers). If there is another test TSV file is provided, this TSV file will be split into two sets (train and valid) as specified by `--split_ratio` (which are 2 float numbers). 3. **Local features** (`--local_feature`, optional). A binary file generated by CCMPred using the `-b` option (note that to use the `-b` option you need to install CCMPred from its latest GitHub branch instead of the release; you may also need to install `libmsgpack-dev`. See instructions [below](#generate-local-features-using-hhblits-and-ccmpred)). ECNet will extract local features from this file. This file is optional. If not provided, please add `--no_local_feature` flag when running `run_example.py` (or, equivalently, set `use_local_features=False` for the `ECNet` class) and ECNet won't use the local features. See below for instruction of generating this binary file using HHblits and CCMPred. 3. **Additional test TSV file** (`--test`, optional). This file has the same format as the `--train` TSV file. We suggest users tune hyperparameters for new protein. Several hyperparameters are exposed as arguments, e.g., `d_embed`, `d_model`, `d_h`, `n_layers`, etc. ## Generate local features using HHblits and CCMPred 1. Install [HHsuite](https://github.com/soedinglab/hh-suite) and [CCMPred](https://github.com/soedinglab/CCMpred) following their instructions. Note that CCMPred should be installed from the latest branch instead of the release, otherwise the `-b` option is not available. Also, as CCMPred uses `msgpack` to create the binary file, you may also need to install `libmsgpack-dev` on your system if it is not available. For example, on Ubuntu, you can run `sudo apt update` then `sudo apt install libmsgpack-dev`. 2. Prepare a FASTA file `example.fasta` of the wild-type sequence of our interested protein. 3. Search the homologous sequences of the wild-type sequence using `hhblits` in HHsuite. (There multiple ways to search homologous sequences and format the alignment. Below we describe a way that uses hhblits to search homologous sequences. Other ways are also feasible, e.g., using jackhmmer as described in the [DeepSequence](https://www.nature.com/articles/s41592-018-0138-4) paper.) ```bash hhblits -i example.fasta \ -d ${path_to_hhblits_database} \ -o example.hhr \ -oa3m example.a3m \ -n 3 \ -id 99 \ -cov 50 \ -cpu 8 ``` 4. Reformat the a3m output of hhblits to PSICOV format (solution modified from [here](https://github.com/soedinglab/bbcontacts/blob/master/TUTORIAL.md#step-13-reformat-the-output-alignment)). In order to run CCMpred, the alignment must be reformatted to the "PSICOV" format used by CCMpred. We can first use the `reformat.pl` script from the `hh-suite/scripts` directory to get an alignment in fasta format and then the `convert_alignment.py` from the `CCMpred/scripts` directory to get the PSICOV format: ```bash ${path_to_hh-suite}/scripts/reformat.pl example.a3m example.fas -r python ${path_to_CCMpred}/scripts/convert_alignment.py example.fas fasta example.psc ``` 5. Run CCMPred ```bash ccmpred example.psc example.mat -b example.braw -d 0 ``` 6. Use the argument `--local_feature example.braw` to provide the local features to ECNet. ## Train on dataset A and test on dataset B The following example shows how to train ECNet on dataset A (passed via `--train`) and test it on another dataset B (passed via `--test`). - Example 1: train on single-mutant fitness data of RRM ([source](https://rnajournal.cshlp.org/content/19/11/1537.long)), and predict for double-mutants ``` CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \ --train data/RRM_single.tsv \ --test data/RRM_double.tsv \ --fasta data/RRM.fasta \ --local_feature data/RRM.braw \ --output_dir ./output/RRM \ --save_checkpoint \ --n_ensembles 2 \ --epochs 100 ``` - Example 2: you can also load the trained model using the `--save_model_dir` argument and predict for test dataset: ``` CUDA_VISIBLE_DEVICES=0 python scripts/run_example.py \ --test data/RRM_double.tsv \ --fasta data/RRM.fasta \ --local_feature data/RRM.braw \ --n_ensembles 2 \ --output_dir ./output/RRM \ --saved_model_dir ./output/RRM ``` ## Citation > Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. *Nat Commun* **12**, 5743 (2021). https://doi.org/10.1038/s41467-021-25976-8 ``` @article{luo2021ecnet, doi = {10.1038/s41467-021-25976-8}, url = {https://doi.org/10.1038/s41467-021-25976-8}, year = {2021}, month = sep, publisher = {Springer Science and Business Media {LLC}}, volume = {12}, number = {1}, author = {Yunan Luo and Guangde Jiang and Tianhao Yu and Yang Liu and Lam Vo and Hantian Ding and Yufeng Su and Wesley Wei Qian and Huimin Zhao and Jian Peng}, title = {{ECNet} is an evolutionary context-integrated deep learning framework for protein engineering}, journal = {Nature Communications} } ``` ## Contact Please submit GitHub issues or contact Yunan Luo (luoyunan[at]gmail[dot]com) for any questions related to the source code.