# bert_document_classification **Repository Path**: panxuefeng235/bert_document_classification ## Basic Information - **Project Name**: bert_document_classification - **Description**: long text - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-12-08 - **Last Updated**: 2021-12-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # :book: BERT Long Document Classification :book: an easy-to-use interface to fully trained BERT based models for multi-class and multi-label long document classification. pre-trained models are currently available for two clinical note (EHR) phenotyping tasks: smoker identification and obesity detection. To sustain future development and improvements, we interface [pytorch-transformers](https://github.com/huggingface/pytorch-transformers) for all language model components of our architectures. Additionally, their is a [blog post](http://andriymulyar.com/blog/bert-document-classification) describing the idea behind the architecture. *This repository contains an updated implementation that corrects an error found in the original version of the preprint* # Installation Install with pip: ``` pip install bert_document_classification ``` or directly: ``` pip install git+https://github.com/AndriyMulyar/bert_document_classification ``` # Use Maps text documents of arbitrary length to binary vectors indicating labels. ```python from bert_document_classification.models import SmokerPhenotypingBert from bert_document_classification.models import ObesityPhenotypingBert smoking_classifier = SmokerPhenotypingBert(device='cuda', batch_size=10) #defaults to GPU prediction obesity_classifier = ObesityPhenotypingBert(device='cpu', batch_size=10) #or CPU if you would like. smoking_classifier.predict(["I'm a document! Make me long and the model can still perform well!"]) ``` More [examples](/examples). # Replication Go to the directory [/examples/ml4health_2019_replication](/examples/ml4health_2019_replication). This [README](/examples/ml4health_2019_replication/data/README.md) will give instructions on how to appropriately insert data from DBMI to replicate the results in the paper. # Notes - For training you will need a GPU. - For bulk inference where speed is not of concern lots of available memory and CPU cores will likely work. - Model downloads are cached in `~/.cache/torch/bert_document_classification/`. Try clearing this folder if you have issues. # Acknowledgement If you found this project useful, consider citing our extended abstract. ``` @misc{mulyar2019phenotyping, title={Phenotyping of Clinical Notes with Improved Document Classification Models Using Contextualized Neural Language Models}, author={Andriy Mulyar and Elliot Schumacher and Masoud Rouhizadeh and Mark Dredze}, year={2019}, eprint={1910.13664}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` Implementation, development and training in this project were supported by funding from the Mark Dredze Lab at Johns Hopkins University.