# nCoV_sentence_simi **Repository Path**: linux_man/nCoV_sentence_simi ## Basic Information - **Project Name**: nCoV_sentence_simi - **Description**: nCoV related sentence similarity by BERT - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2020-03-24 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # nCoV-2019 related sentence similarity If useful for you, maybe a star to encourage our work. ### introduce ERNIE , RoBerta based model for sentence similarity For example: ``` 387,支原体肺炎,支原体肺炎的症状及治疗方法是什么,肺炎衣原体与肺炎支原体有什么区别?,0 388,支原体肺炎,支原体肺炎的症状及治疗方法是什么,肺炎支原体培养及药敏的检验单怎么看?,0 389,支原体肺炎,支原体肺炎的症状及治疗方法是什么,小儿支原体与小儿支原体肺炎相同吗?,0 390,支原体肺炎,宝宝支原体肺炎感染的症状有哪些?,宝宝肺炎支原体感染的症状是什么?,1 391,支原体肺炎,宝宝支原体肺炎感染的症状有哪些?,宝宝支原体肺炎感染有什么症状?,1 ``` ### 95.2 acc online (simply choose the 1st fold, 1/6) - ERNIE 1.0 - Nadam with 2.0*1e-5 lr - OHEM CE, with label smoothing - cosine lr scheduler with warmup - clean noise data by an overfitted model ### more tricks maybe - simply change the model - add any 'word2vec' features - split into multipiece data,get N bert,
using multiple feature to train a tree based
model, lightGBM, Xgboost... - for those hard example, maybe add the nearest sentence
(pair with label) for reference info, into bert - pseudo label - more open data(e.g ping an CHIP 2019) - ... ## denpendency - opencv-python - pytorch >= 1.4 - pandas - yacs - sklearn ## prepare - download the ernie (128 length) model from https://github.com/nghuyong/ERNIE-Pytorch - but using the config in this repo at pretrained/ernie/ ## train you maye change the data path, have a look at train.py test.py ``` export PYTHONPATH=./ sh train_pipeline.sh ``` ## ref https://tianchi.aliyun.com/competition/entrance/231776/introduction?spm=5176.12281949.1003.4.21eb2448atCLQk