# bert_document_classification

**Repository Path**: panxuefeng235/bert_document_classification

## Basic Information

- **Project Name**: bert_document_classification
- **Description**: long text
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-12-08
- **Last Updated**: 2021-12-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# :book: BERT Long Document Classification :book:
an easy-to-use interface to fully trained BERT based models for multi-class and multi-label long document classification.

pre-trained models are currently available for two clinical note (EHR) phenotyping tasks: smoker identification and obesity detection.

To sustain future development and improvements, we interface [pytorch-transformers](https://github.com/huggingface/pytorch-transformers)
for all language model components of our architectures. Additionally, their is a [blog post](http://andriymulyar.com/blog/bert-document-classification) describing the idea behind the architecture.

*This repository contains an updated implementation that corrects an error found in the original version of the preprint*

# Installation

Install with pip:

```
pip install bert_document_classification
```

or directly:

```
pip install git+https://github.com/AndriyMulyar/bert_document_classification
```

# Use
Maps text documents of arbitrary length to binary vectors indicating labels.
```python
from bert_document_classification.models import SmokerPhenotypingBert
from bert_document_classification.models import ObesityPhenotypingBert

smoking_classifier = SmokerPhenotypingBert(device='cuda', batch_size=10) #defaults to GPU prediction

obesity_classifier = ObesityPhenotypingBert(device='cpu', batch_size=10) #or CPU if you would like.

smoking_classifier.predict(["I'm a document! Make me long and the model can still perform well!"])
```
More [examples](/examples).


# Replication
Go to the directory [/examples/ml4health_2019_replication](/examples/ml4health_2019_replication). This [README](/examples/ml4health_2019_replication/data/README.md) will
give instructions on how to appropriately insert data from DBMI to replicate the results in the paper. 

# Notes
- For training you will need a GPU.
- For bulk inference where speed is not of concern lots of available memory and CPU cores will likely work.
- Model downloads are cached in `~/.cache/torch/bert_document_classification/`. Try clearing this folder if you have issues.


# Acknowledgement
If you found this project useful, consider citing our extended abstract.

```
@misc{mulyar2019phenotyping,
    title={Phenotyping of Clinical Notes with Improved Document Classification Models Using Contextualized Neural Language Models},
    author={Andriy Mulyar and Elliot Schumacher and Masoud Rouhizadeh and Mark Dredze},
    year={2019},
    eprint={1910.13664},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```

Implementation, development and training in this project were supported by funding from the Mark Dredze Lab at Johns Hopkins University.