# GenIE **Repository Path**: xmadog/GenIE ## Basic Information - **Project Name**: GenIE - **Description**: No description available - **Primary Language**: Python - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-10-31 - **Last Updated**: 2023-10-31 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # GenIE: Generative Information Extraction This repository contains a PyTorch implementation of the autoregressive information extraction system GenIE proposed in the paper [GenIE: Generative Information Extraction](https://arxiv.org/abs/2112.08340). We extend these ideas in our follow-up work on SynthIE, visit [this](https://github.com/epfl-dlab/SynthIE) link for details. ``` @inproceedings{josifoski-etal-2022-genie, title = "{G}en{IE}: Generative Information Extraction", author = "Josifoski, Martin and De Cao, Nicola and Peyrard, Maxime and Petroni, Fabio and West, Robert", booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jul, year = "2022", address = "Seattle, United States", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.naacl-main.342", doi = "10.18653/v1/2022.naacl-main.342", pages = "4626--4643", } ``` **Please consider citing our work, if you found the provided resources useful.** --- ## GenIE in a Nutshell GenIE uses a sequence-to-sequence model that takes unstructured text as input and autoregressively generates a structured semantic representation of the information expressed in it, in the form of (subject, relation, object) triplets, as output. GenIE employs constrained beam search with: (i) a high-level, structural constraint which asserts that the output corresponds to a set of triplets; (ii) lower-level, validity constraints which use prefix tries to force the model to only generate valid entity or relation identifiers (from a predefined schema). Here is an illustration of the generation process for a given example: ![](docs/genie_animation.gif) Our experiments show that GenIE achieves state-of-the-art performance on the taks of closed information extraction, generalizes from fewer training data points than baselines, and scales to a previously unmanageable number of entities and relations. ## Dependencies To install the dependencies needed to execute the code in this repository run: ```bash bash setup.sh ``` ## Usage Instructions & Examples The [demo notebook](notebooks/Demo.ipynb) provides a full review of how to download and use **GenIE**'s functionalities, as well as the additional data resources. ## Training & Evaluation #### Training Each of the provided models (see [demo](notebooks/Demo.ipynb)) is associated with a Hydra configuration file that reproduces the training. For instance, to run the training for the genie_r model run: ``` MODEL_NAME=genie_r python run.py experiment=$MODEL_NAME ``` #### Evaluation [Hydra](https://hydra.cc/docs/intro/) provides a clean interface for evaluation. You just need to specify the checkpoint that needs to be evaluated, the dataset to evaluate it on, and the constraints to be enforced (or not) during generation: ``` PATH_TO_CKPT= # The name of the dataset (e.g. "rebel", "fewrel", "wiki_nre", "geo_nre") DATASET_NAME=rebel # rebel, fewrel, wiki_nre or geo_nre # The constraints to be applied ("null" -> unconstrained, "small" or "large"; see the paper or the demo for details) CONSTRAINTS=large python run.py run_name=genie_r_rebel +evaluation=checkpoint_$CONSTRAINTS datamodule=$DATASET_NAME model.checkpoint_path=$PATH_TO_CKPT ``` To run the evaluation in a distributed fashion (e.g. with 4 GPUs on a single machine) add the option trainer=ddp trainer.gpus=4 to the call. From here, to generate the plots and the bootstrapped results reported in the paper run python run.py +evaluation=results_full. See the [configuration file](configs/evaluation/results_full.yaml) for details. --- ### License This project is licensed under the terms of the MIT license.