# Doc2EDAG **Repository Path**: ma-lechi/Doc2EDAG ## Basic Information - **Project Name**: Doc2EDAG - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-12-04 - **Last Updated**: 2021-12-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Doc2EDAG Source code for the paper, ["Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction"](https://arxiv.org/abs/1904.07535), in EMNLP 2019. ## Overview **Document-level Event Extraction** (DEE) is urgently demanded in many applications but faces two major challenges: - **Arguments-scattering**: arguments of a event record are always scattered across multiple sentences of a document. - **Multi-event**: multiple event records with scattered arguments frequently coexists in one document. Below we show an example to intuitively illustrate these two challenges.

To combat aforementioned challenges, we propose a truly end-to-end model, **Doc2EDAG**, for DEE, which can take a document as the input and directly emit event tables with multiple entries. In general, the end-to-end DEE needs to complete the following tasks jointly: - **Entity Extraction** (easy) - **Event Triggering** (easy) - **Event Table Filling** (hard) How can Doc2EDAG achieve this? It is owing to a novel structure, the **entity-based directed acyclic graph** (EDAG). Instead of directly filling a table, Doc2EDAG just generates an EDAG in an auto-regressive manner. In this way, a hard table filling task is decomposed into several path-expanding sub-tasks that are more tractable. The following figure shows the overall architecture of Doc2EDAG, for more details, please refer to our paper.

## Dataset We utilize financial announcements of listed companies in China from 2008 to 2018 and build a large-scale dataset for DEE via distant supervision. Run `unzip Data.zip` ## Usage ### Setup Please use `Python 3(.6)` as well as the following packages: ```text torch >= 1.0.0 pytorch-pretrained-bert == 0.4.0 tensorboardX numpy tqdm ``` ### Training For a machine with 8 GPUs, run ```bash ./train_multi.sh 8 --task_name [TASK_NAME] ``` If you want to use only 4 GPUs (Id 0,3,5,7), run ```bash CUDA_VISIBLE_DEVICES=0,3,5,7 ./train_multi.sh 4 --task_name [TASK_NAME] --gradient_accumulation_steps 16 ``` Please note that - By setting a large step length of gradient accumulation, we can achieve large batch training with a few common GPUs. Specifically, for Titan X (12GB Memory), you should maintain `B/(N*G) == 1`, where `B`, `N` and `G` denote the batch size, the number of GPUs, and the step size of gradient accumulation, respectively. - If you want to use BERT, just set `--use_bert True`, but using BERT requires much larger GPU memory (at least 24GB, the more the better). ### Evaluation To get evaluation results, run ```bash ./eval.sh --task_name [TASK_NAME] --eval_model_names DCFEE-O,DCFEE-M,GreedyDec,Doc2EDAG ``` You can run this evaluation script at any time after the start of training, and it will report the latest information. ### Reproducing Experiments To reproduce all experiments reported in our paper, just run ```bash ./reprod_all_exps.sh ``` Please note that we assume you have 8 GPUs each with 12GB memory and the total runtime can be very long. ## Citation If you find our work interesting, you can cite the paper as ```text @inproceedings{zheng2019doc2edag, title={{Doc2EDAG}: An End-to-End Document-level Framework for Chinese Financial Event Extraction}, author={Zheng, Shun and Cao, Wei and Xu, Wei and Bian, Jiang}, booktitle={EMNLP}, year={2019} } ```