# pdcner

**Repository Path**: gisfanmachel/pdcner

## Basic Information

- **Project Name**: pdcner
- **Description**: A Pig Disease Chinese Named Entity Recognition (PDCNER) model
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 0
- **Created**: 2024-06-02
- **Last Updated**: 2024-06-29

## Categories & Tags

**Categories**: Uncategorized

**Tags**: 工作

## README

# pdcner
 A Pig Disease Chinese Named Entity Recognition (PDCNER) model

 #    A Pig Disease Chinese Named Entity Recognition (PDCNER) model

The model integrates external lexicon knowledge of pig disease by employing Lexicon-enhanced BERT and enhance feature representation by incorporating contrastive learning.

https://github.com/tufeifei923/pdcner

# Requirement

* Python 3.8.0
* apex 0.1
* Transformer 3.4.0
* Numpy 1.19.2
* Packaging 23.2
* skicit-learn 0.23.2
* torch 1.6.0+cu101
* torchvision  0.7.0+cu101
* tqdm 4.66.2
* multiprocess 0.70.10
* tensorflow-gpu 2.0.0
* tensorboardX 2.1
* seqeval 1.2.1

# Input Format

CoNLL format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.

```cpp
猪 B-disease
蓝 M-disease
耳 M-disease
病 E-disease
曾 O
称 O
为 O
“ O
神 B-disease
秘 M-disease
猪 M-disease
病 E-disease
” O
、 O
“ O

体 O
温 O
一 O
般 O
正 O
常 O
， O
如 O
有 O
继 B-symptom
发 I-symptom
感 I-symptom
染 E-symptom
则 O
```

# Chinese BERT，Chinese Word Embedding, and Checkpoints

### Chinese BERT

Chinese BERT: https://huggingface.co/bert-base-chinese/tree/main [!--https://cdn.huggingface.co/bert-base-chinese-pytorch_model.bin--](!--https://cdn.huggingface.co/bert-base-chinese-pytorch_model.bin--)

### Chinese word embedding:

Word Embedding: https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0.tar.gz


# Directory Structure of data

* berts
  * bert
    * config.json
    * vocab.txt,the vocab of pig disease word embedding table
    * pytorch_model.bin
* dataset, you can download from [here](https://drive.google.com/file/d/1QUn7ssSah2KbFQkWZEL5LWENpyqT2TM0/view?usp=sharing)
  * NER
    * nky-pig   
* vocab
  * tencent_vocab.txt, the vocab of pre-trained word embedding table, download from [here](https://drive.google.com/file/d/1UmtbCSPVrXBX_y4KcovCknJFu9bXXp12/view?usp=sharing).
* embedding
  * word_embedding.txt
* result
  * NER
    * nky-pig  
* log

# Run

* 1.split samples by percent radio ,`python3 split_txt.py`
* 2.convert .txt file to .json file, `python3 txt_json.py`
* 3.run the shell us single thread, `sh run_nky_pig.sh`
* 4.run the shell us multi thread, `sh run_nky_pig_multi.sh`


# Cite

```
@inproceedings{liu-etal-2021-lexicon,
    title = "Lexicon Enhanced {C}hinese Sequence Labeling Using {BERT} Adapter",
    author = "Liu, Wei  and
      Fu, Xiyan  and
      Zhang, Yue  and
      Xiao, Wenming",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.454",
    doi = "10.18653/v1/2021.acl-long.454",
    pages = "5847--5858"
}

```