# text_to_table **Repository Path**: tangentor/text_to_table ## Basic Information - **Project Name**: text_to_table - **Description**: This repository is for our ACL2022 paper: Text-to-Table: A New Way of Information Extraction. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-02-07 - **Last Updated**: 2023-02-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README **Update 01/12/2023**: for BERTScore evaluation, using `transformers>4.17.0` will lead to different results as in the paper. If you have difficulty replicating the BERTScore results, please try downgrading to `transformers<4.17.0`. In my experiments I used `transformers==3.1.0`. For details please refer to https://github.com/shirley-wu/text_to_table/issues/7 --- ## Introduction This repository is for our ACL2022 paper: [Text-to-Table: A New Way of Information Extraction](https://arxiv.org/abs/2109.02707). ## Requirements Training requires `fairseq==v0.10.2`, and evaluation requires `sacrebleu==v2.0.0 bert_score==v0.3.11` Or you can directly install by `pip install -r requirements.txt`. Note: to avoid potential incompatibility, your fairseq version should be **exactly v0.10.2**, and your python version should be **<3.9** ## Dataset You can download the four datasets from [Google Drive](https://drive.google.com/file/d/1zTfDFCl1nf_giX7IniY5WbXi9tAuEHDn/view?usp=sharing). If you are interested in preprocessing original table-to-text datasets into our text-to-table datasets, please check [data_preprocessing](data_preprocessing/). For preprocessing, we use `fairseq` for BPE and binarization. You need to first download a BART model [here](https://github.com/pytorch/fairseq/tree/main/examples/bart), and then use `scripts/preprocess.sh` to preprocess the data. The script has two arguments: the first is the data path and the second is the bart model path, e.g., ```bash bash scripts/preprocess.sh data/rotowire/ bart.base/ ``` then you'll have BPE-ed files under `data/rotowire` and binary files under `data/rotowire/bins`. ## Training For each dataset, use `scripts/dataset-name/train_vanilla.sh` to train a vanilla seq2seq model, and use `scripts/dataset-name/train_vanilla.sh` to train a HAD model. The training scripts have two arguments: the first is the data path (NOTE: it's not the path to the binary files) and the second is the bart model path, e.g., ```bash bash scripts/rotowire/train_had.sh data/rotowire/ bart.base/ ``` Additionally, for Rotowire and WikiTableText, the datasets are very small, so we run experiments with 5 seeds (1, 10, 20, 30, 40) and report the average numbers. Scripts under `scripts/rotowire` and `scripts/wikitabletext` have the seed as the third argument. Rotowire and WikiBio experiments are run on 8 GPUs. E2E and WikiTableText experiments are run on 1 GPU. You'll need GPUs that supports `--fp16` (such as V100). If not, please remove the `--fp16` option in the scripts. ## Inference and Evaluation For each dataset, use `scripts/dataset-name/test_vanilla.sh` to test with vanilla decoding, and use `scripts/dataset-name/test_constraint.sh` to test with table constraint. The test scripts have two arguments: the first is the data path and the second is the checkpoint path (by default it is where your saved checkpoint goes to), e.g., ```bash bash scripts/rotowire/test_constraint.sh data/rotowire/ ``` Similar to training, you'll need GPUs that supports `--fp16`. If not, please remove `--fp16` in the script.