# ace2005parser

**Repository Path**: billy_liu/ace2005parser

## Basic Information

- **Project Name**: ace2005parser
- **Description**: ACE2005事件抽取数据预处理
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 2
- **Forks**: 0
- **Created**: 2021-01-04
- **Last Updated**: 2022-05-17

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

﻿# ACE2005事件抽取数据预处理

* **补充说明: 这份代码只处理了ACE05语料中含有事件的句子，而这在具体的论文实验中是错误的，应当使用全量数据，在此说明。**

ACE2005事件抽取数据预处理工作是指，根据原始的.apf.xml和.sgm文件，提取与事件有关的要素(sentence,trigger,argument及trigger和argument在原文中的offset)，并通过StandfordCoreNLP对sentence进行词性和句法依赖解析，根据.apf.xml文件中的entity、value和timex2对句子进行"BIO"的类型标注，最终将结果以json形式写入。

[2019.08.02更]add_bio.py存在的一个小bug:对于一些entity/time/value都不包含的句子，将会出现没有bio的情况(其他正常的提取没有问题)。这里代码不改了，后续添加吧(例如:un/rec.arts.mystery_20050219.1126)

## 代码说明

### requirements

* xml.etree.cElementTree
* stanfordcorenlp
* tokenize

* standforcoeNLP模型下载地址

https://stanfordnlp.github.io/CoreNLP/history.html


### parse.py

从xml文件中提取事件要素信息，存入./event_json

### std_parse.py

结合./event_json和.apf.xml文件，首先使用NLTK进行分句，然后用StandfordCorNLP进行必要的标注并计算每个token相对全文的offset, 结果写入./anno_event_json


### add_bio.py

给./anno_event_json中的标注结果添加上BIO标注信息(根据.apf.xml中的entity,value和time进行BIO标注), 结果写入./anno_event_json_final/目录下。

注意，部分sentence中不包括LDC所标注的entity/value/time,因此这些sentence的字典中没有'bio'一项，例如
```
{"article_id":"AFP_ENG_20030304.0250",
"event_id":"AFP_ENG_20030304.0250-EV5-1",
"event_type":"Conflict",
"event_subtype":"Attack",
"sentence_id":-7493870223321462845,
"sentence":"There were no reports of injuries in the second blast.",
"sentence_start":862,
"sentence_end":916,
"tokens":["There", "were", "no", "reports", "of", "injuries", "in", "the", "second", "blast", "."],
"pos":[["There", "EX"], ["were", "VBD"], ["no", "DT"], ["reports", "NNS"], ["of", "IN"], ["injuries", "NNS"], ["in", "IN"], ["the", "DT"], ["second", "JJ"], ["blast", "NN"], [".", "."]],
"ner":[["There", "O"], ["were", "O"], ["no", "O"], ["reports", "O"], ["of", "O"], ["injuries", "O"], ["in", "O"], ["the", "O"], ["second", "ORDINAL"], ["blast", "O"], [".", "O"]],
"dependency":[["ROOT", 0, 2], ["expl", 2, 1], ["neg", 4, 3], ["nsubj", 2, 4], ["case", 6, 5], ["nmod", 4, 6], ["case", 10, 7], ["det", 10, 8], ["amod", 10, 9], ["nmod", 2, 10], ["punct", 2, 11]],
"trigger":"blast",
"trigger_start":"910",
"trigger_end":"914",
"arguments":[],
"tokens_offset":[[862, 866], [868, 871], [873, 874], [876, 882], [884, 885], [887, 894], [896, 897], [899, 901], [903, 908], [910, 914], [915, 915]]}
```

### 数据样例

见./example/目录下

## 其他说明

之前有使用[ace-data-prep](https://github.com/mgormley/ace-data-prep/)进行预处理，但根据处理结果观察，得到的是用于关系抽取的预处理结果，数据中不包含事件抽取相关要素信息。处理过程记录见https://blog.csdn.net/carrie_0307/article/details/91128013


## TO DO

1. 列出train/dev/test的数据列表

2. 将trigger identification, argument identification和argument classification的"labels"整理出来


---

以上是数据处理过程，欢迎大家使用。

事件总量: 5349

2019.06.21