# Relation-Extraction **Repository Path**: semantic_kg/Relation-Extraction ## Basic Information - **Project Name**: Relation-Extraction - **Description**: 支持schema 约束下的关系抽取，也就是在给定关系集合下，从自然语言文本中抽取出符合关系 schema 约束的 SPO 三元组知识。本开源项目受国家重点研发计划“云计算和大数据”专项支持（项目号 2018YFB1004300 ）。 - **Primary Language**: Python - **License**: MulanPSL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 1 - **Created**: 2020-11-17 - **Last Updated**: 2021-12-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ## Relation Extraction Baseline System—InfoExtractor 2.0 ### Abstract InfoExtractor 2.0 is a relation extraction baseline system developed for DuIE 2.0. Different from [DuIE 1.0](http://lic2019.ccf.org.cn/kg), the new 2.0 task is more inclined to colloquial language, and further introduces **complex relations** which entails multiple objects in one single SPO. For detailed information about the dataset, please refer to the official website of our [competition](https://aistudio.baidu.com/aistudio/competition/detail/31?isFromCcf=true). InfoExtractor 2.0 is built upon a SOTA pre-trained language model [ERNIE](https://arxiv.org/abs/1904.09223) using PaddlePaddle. We design a structured **tagging strategy** to directly fine-tune ERNIE, through which multiple, overlapped SPOs can be extracted in **a single pass**. ### Tagging Strategy Our tagging strategy is designed to discover multiple, overlapped SPOs in the DuIE 2.0 task. Based on the classic 'BIO' tagging scheme, we assign tags (also known as labels) to each token to indicate its position in an entity span. The only difference lies in that a "B" tag here is further distinguished by different predicates and subject/object dichotomy. Suppose there are N predicates. Then a "B" tag should be like "B-predicate-subject" or "B-predicate-object", which results in 2*N **mutually exclusive** "B" tags. After tagging, we treat the task as token-level multi-label classification, with a total of (2*N+2) labels (2 for the “I” and “O” tags). Below is a visual illustration of our tagging strategy:

For **complex relations** in the DuIE 2.0 task, we simply treat affiliated objects as independent instances (SPOs) which share the same subject. Anything else besides the tagging strategy is implemented in the most straightforward way. The model input is: *input text* , and the final hidden states are directly projected into classification probabilities. ### Environments Python3 + Paddle Fluid 1.5 (please confirm your Python path in scripts). Dependencies are listed in `./requirements.txt`. The code is tested on a single P40 GPU, with CUDA version=10.1, GPU Driver Version = 418.39. ### Download Dataset Please download the training data, development data from the [competition website](https://aistudio.baidu.com/aistudio/competition/detail/31?isFromCcf=true), then unzip files into `./data/` and rename them to `train.json`, `dev.json`. MD5 code for train.json: c31b9c53382b29688867ff0cfdc57ec6 MD5 code for dev.json: 05b4ac0336b0a5ff402115a5d2060331 ### Download pre-trained ERNIE model Download ERNIE1.0 Base（max-len-512）model and extract it into `./pretrained_model/` ``` cd ./pretrained_mdoel/ wget --no-check-certificate https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz tar -zxvf ERNIE_1.0_max-len-512.tar.gz ``` ### Training ``` sh ./script/train.sh ``` By default the checkpoints will be saved into `./checkpoints/` GPU ID can be specified in the script. On P40 devices, the batch size can be assigned up to 64 under 256 max-seq-len setting. Multi-gpu training is supported after `LD_LIBRARY_PATH` is specified in the script: ``` export LD_LIBRARY_PATH=/your/custom/path:$LD_LIBRARY_PATH ``` **Accuracy** (token-level and example-level) is printed during the during the training procedure. ### Prediction Specify your checkpoints dir in the prediction script, and then run: ``` sh ./script/predict.sh ``` This will write the predictions into a json file with the same format as the original dataset (required for final official evaluation). GPU ID and batch size can be specified in the script. The final prediction file is saved into `./data/` ### Official Evaluation Zip your prediction json file and then run official evaluation: ``` zip ./data/predict_dev.json.zip ./data/predict_dev.json python ./script/re_official_evaluation.py --golden_file=./data/dev.json --predict_file=./data/predict_dev.json.zip [--alias_file alias_dict] ``` Precision, Recall and F1 scores are used as the official evaluation metrics to measure the performance of participating systems. Alias file lists entities with more than one correct mentions. It is not provided due to security reasons.