# Multi-BioNER

**Repository Path**: a7087485/Multi-BioNER

## Basic Information

- **Project Name**: Multi-BioNER
- **Description**: Cross-type Biomedical Named Entity Recognition with Deep Multi-task Learning (Bioinformatics'19)
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2019-12-29
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Cross-type Biomedical Named Entity Recognition with Deep Multi-task Learning

This project provides a neural network based multi-task learning framework for biomedical named entity recognition (BioNER).

The implementation is based on the PyTorch library. Our model collectively trains different biomedical entity types to build a unified model that benefits the training of each single entity type and achieves a significantly better performance compared with the state-of-the-art BioNER systems.

## Links

- [Installation](#installation)
- [Quick Start](#quick-start)
- [Data](#data)
- [Usage](#usage)
- [Benchmarks](#benchmarks)
- [Prediction](#prediction)

## Installation

For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.

#### PyTorch

The code is based on PyTorch. You can find installation instructions [here](http://pytorch.org/). 

#### Dependencies

The code is written in Python 3.6. Its dependencies are summarized in the file ```requirements.txt```. You can install these dependencies like this:
```
pip3 install -r requirements.txt
```

## Quick Start

To reproduce the results in our [paper](https://arxiv.org/abs/1801.09851), you can first download the corpora and the embedding file **[here](https://drive.google.com/file/d/1JHQJ9DKaEeSGZdA0Nmz9KCdtjUoJKXCb/view?usp=sharing)**, unzip the folder ```data_bioner_5/``` and put it under the main folder ```./```. Then the following running script can be used to run the model.
```
./run_lm-lstm-crf5.sh
```

## Data

We use five biomedical corpora collected by Crichton et al. for biomedical NER. The dataset is publicly available and can be downloaded from [here](https://github.com/cambridgeltl/MTL-Bioinformatics-2016). The details of each dataset are listed below:

|Dataset | Entity Type | Dataset Size | 
| ------------- |-------------| -----|
| BC2GM | Gene/Protein | 20,000 sentences |
| BC4CHEMD | Chemical | 10,000 abstracts |
| BC5CDR | Chemical, Disease | 1,500 articles |
| NCBI-disease | Disease | 793 abstracts |
| JNLPBA | Gene/Protein, DNA, Cell Type, Cell Line, RNA | 2,404 abstracts |

#### Note
**In our paper, we merge the original training set and development set to be the new training set, as many teams did in the challenge. Some previous work (e.g., [Luo et al., Bioinformatics 2017](https://github.com/lingluodlut/Att-ChemdNER), [Lu et al., Journal  of
cheminformatics 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331694/pdf/1758-2946-7-S1-S4.pdf) and [Leaman and Lu, Bioinformatics 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018376/pdf/btw343.pdf)) also preprocessed data in this way. If you want to reproduce our results, please follow the same way.**

#### Format

Users may want to use other datasets. We assume the corpus is formatted as same as the CoNLL 2003 NER dataset.

More specifically, **empty lines** are used as separators between sentences, and the separator between documents is a special line as below.
```
-DOCSTART- -X- -X- -X- O
```
Other lines contains words, labels and other fields. **Word** must be the **first** field, **label** must be the **last**. For example,
```
-DOCSTART- -X- -X- -X- O

Selegiline	S-Chemical
-	O
induced	O
postural	B-Disease
hypotension	E-Disease
in	O
Parkinson	B-Disease
'	I-Disease
s	I-Disease
disease	E-Disease
:	O
a	O
longitudinal	O
study	O
on	O
the	O
effects	O
of	O
drug	O
withdrawal	O
.	O
```

#### Embedding
We initialize the word embedding matrix with pre-trained word vectors from Pyysalo et al., 2013. These word vectors are
trained using the skip-gram model on the PubMed abstracts together with all the full-text articles
from PubMed Central (PMC) and a Wikipedia dump. You can download the embedding files [here](http://evexdb.org/pmresources/vec-space-models/).

## Usage

```train_wc.py``` is the script for our multi-task LSTM-CRF model.
The usages of it can be accessed by
```
python train_wc.py -h
```

The default running commands are:
```
python3 train_wc.py --train_file [training file 1] [training file 2] ... [training file N] \
                    --dev_file [developing file 1] [developing file 2] ... [developing file N] \
                    --test_file [testing file 1] [testing file 2] ... [testing file N] \
                    --caseless --fine_tune --emb_file [embedding file] --shrink_embedding --word_dim 200
```

Users may incorporate an arbitrary number of corpora into the training process. In each epoch, our model randomly selects one dataset _i_. We use training set _i_ to learn the parameters and developing set _i_ to evaluate the performance. If the current model achieves the best performance for dataset _i_ on the developing set, we will then calculate the precision, recall and F1 on testing set _i_.

## Benchmarks

Here we compare our model with recent state-of-the-art models on the five datasets mentioned above. We use F1 score as the evaluation metric.

|Model | [BC2GM](https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC2GM-IOBES) | [BC4CHEMD](https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC4CHEMD-IOBES) | [BC5CDR](https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/BC5CDR-IOBES) | [NCBI-disease](https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/NCBI-disease-IOBES) | [JNLPBA](https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/JNLPBA-IOBES) |
| ------------- |-------------| -----| -----| -----| ---- |
| Dataset Benchmark | - | 88.06 | 86.76 | 82.90 | 72.55 |
| [Crichton et al. 2016](https://github.com/cambridgeltl/MTL-Bioinformatics-2016) | 73.17 | 83.02 | 83.90 | 80.37 | 70.09 |
| [Lample et al. 2016](https://github.com/glample/tagger) | 80.51 | 87.74 | 86.92 | 85.80 | 73.48 |
| [Ma and Hovy 2016](https://github.com/XuezheMax/LasagneNLP) | 78.48 | 86.84 | 86.65 | 82.62 | 72.68 |
| [Liu et al. 2018](https://github.com/LiyuanLucasLiu/LM-LSTM-CRF) | 80.00 | 88.75 | 86.96 | 83.92 | 72.17 |
| Our Model | **80.74** | **89.37** | **88.78** | **86.14** | **73.52** |


## Prediction
Our ```train_wc.py``` provides an option to directly output the annotation results during the training process by the parameter ````--output_annotation````, i.e.,
```
python3 train_wc.py --train_file [training file 1] [training file 2] ... [training file N] \
                    --dev_file [developing file 1] [developing file 2] ... [developing file N] \
                    --test_file [testing file 1] [testing file 2] ... [testing file N] \
                    --caseless --fine_tune --emb_file [embedding file] --shrink_embedding --output_annotation --word_dim 200 --gpu 0
```

If users do not use ````--output_annotation````, the best performing model during the training process will be saved in ```./checkpoint/```. 

#### Pre-trained Model
**We have released our pre-trained model. You can download the [Arg](https://drive.google.com/file/d/1CxW75H1NwnUCfnBVWQFdZD9TNbuayUAQ/view?usp=sharing) file and the [Model](https://drive.google.com/file/d/1aBoIUDzU6_DcB0c1Y1t0AoKmcVik0YO1/view?usp=sharing) file and put them in ```./checkpoint/```.**

Using the saved model, ```seq_wc.py``` can be applied to annotate raw text. Its usage can be accessed by command 
```
python seq_wc.py -h
```
and a running command example is provided below:
```
python3 seq_wc.py --load_arg checkpoint/cwlm_lstm_crf.json --load_check_point checkpoint/cwlm_lstm_crf.model --input_file test.tsv --output_file annotate/output --gpu 0
```
The annotation results will be in ```./annotate/```.

The input format is similar to CoNLL, but each line is required to contain only one field, token. For example, an input file could be:
```
The
severe
anemia
(
hemoglobin
1
.
2
g
/
dl
)
appeared
to
be
the
primary
etiologic
factor
.
```
and the corresponding output is:
```
The O
severe O
anemia O
( O
hemoglobin B-GENE
1 I-GENE
. I-GENE
2 I-GENE
g I-GENE
/ I-GENE
dl E-GENE
) O
appeared O
to O
be O
the O
primary O
etiologic O
factor O
. O 
```

## Citation
If you find the implementation useful, please cite the following paper:
```
@article{wang2018cross,
  title={Cross-type biomedical named entity recognition with deep multi-task learning},
  author={Wang, Xuan and Zhang, Yu and Ren, Xiang and Zhang, Yuhao and Zitnik, Marinka and Shang, Jingbo and Langlotz, Curtis and Han, Jiawei},
  journal={Bioinformatics},
  volume={35},
  number={10},
  pages={1745--1752},
  year={2019},
  publisher={Oxford University Press}
}
```