# TaBERT **Repository Path**: hanchu711/TaBERT ## Basic Information - **Project Name**: TaBERT - **Description**: https://github.com/facebookresearch/TaBERT.git - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-02-08 - **Last Updated**: 2024-10-14 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables This repository contains source code for the [`TaBERT` model](https://arxiv.org/abs/2005.08314), a pre-trained language model for learning joint representations of natural language utterances and (semi-)structured tables for semantic parsing. `TaBERT` is pre-trained on a massive corpus of 26M Web tables and their associated natural language context, and could be used as a drop-in replacement of a semantic parsers original encoder to compute representations for utterances and table schemas (columns). ## Installation First, install the conda environment `tabert` with supporting libraries. ```bash bash scripts/setup_env.sh ``` Once the conda environment is created, install `TaBERT` using the following command: ```bash conda activate tabert pip install --editable . ``` **Integration with HuggingFace's pytorch-transformers Library** is still WIP. While all the pre-trained models were developed based on the old version of the library `pytorch-pretrained-bert`, they are compatible with the the latest version `transformers`. The conda environment will install both versions of the transformers library, and `TaBERT` will use `pytorch-pretrained-bert` by default. You could uninstall the `pytorch-pretrained-bert` library if you prefer using `TaBERT` with the latest version of `transformers`. ## Pre-trained Models Pre-trained models could be downloaded from this [Google Drive shared folder](https://drive.google.com/drive/folders/1fDW9rLssgDAv19OMcFGgFJ5iyd9p7flg?usp=sharing). Please uncompress the tarball files before usage. Pre-trained models could be downloaded from command line as follows: ```shell script pip install gdown # TaBERT_Base_(k=1) gdown 'https://drive.google.com/uc?id=1-pdtksj9RzC4yEqdrJQaZu4-dIEXZbM9' # TaBERT_Base_(K=3) gdown 'https://drive.google.com/uc?id=1NPxbGhwJF1uU9EC18YFsEZYE-IQR7ZLj' # TaBERT_Large_(k=1) gdown 'https://drive.google.com/uc?id=1eLJFUWnrJRo6QpROYWKXlbSOjRDDZ3yZ' # TaBERT_Large_(K=3) gdown 'https://drive.google.com/uc?id=17NTNIqxqYexAzaH_TgEfK42-KmjIRC-g' ``` ## Using a Pre-trained Model To load a pre-trained model from a checkpoint file: ```python from table_bert import TableBertModel model = TableBertModel.from_pretrained( 'path/to/pretrained/model/checkpoint.bin', ) ``` To produce representations of natural language text and and its associated table: ```python from table_bert import Table, Column table = Table( id='List of countries by GDP (PPP)', header=[ Column('Nation', 'text', sample_value='United States'), Column('Gross Domestic Product', 'real', sample_value='21,439,453') ], data=[ ['United States', '21,439,453'], ['China', '27,308,857'], ['European Union', '22,774,165'], ] ).tokenize(model.tokenizer) # To visualize table in an IPython notebook: # display(table.to_data_frame(), detokenize=True) context = 'show me countries ranked by GDP' # model takes batched, tokenized inputs context_encoding, column_encoding, info_dict = model.encode( contexts=[model.tokenizer.tokenize(context)], tables=[table] ) ``` For the returned tuple, `context_encoding` and `column_encoding` are PyTorch tensors representing utterances and table columns, respectively. `info_dict` contains useful meta information (e.g., context/table masks, the original input tensors to BERT) for downstream application. ```python context_encoding.shape >>> torch.Size([1, 7, 768]) column_encoding.shape >>> torch.Size([1, 2, 768]) ``` **Use Vanilla BERT** To initialize a TaBERT model from the parameters of BERT: ```python from table_bert import TableBertModel model = TableBertModel.from_pretrained('bert-base-uncased') ``` ## Example Applications TaBERT could be used as a general-purpose representation learning layer for semantic parsing tasks over database tables. Example applications could be found under the `examples` folder. ## Extract/Preprocess Table Corpora from CommonCrawl and Wikipedia ### Prerequisite The following libraries are used for data extraction: * [`jnius`](https://pyjnius.readthedocs.io/en/stable/) * [`info.bliki.wiki`](https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Mediawiki2HTML) * wikitextparser * Beautiful Soup 4 * Java Wikipedia code located at `contrib/wiki_extractor` * It compiles to a `.jar` file using maven, which is also included in the folder * `jdk` 12+ ### Installation Fist, you need to install Java JDK. Then use the following command to install necessary Python libraries. ``` pip install -r preprocess/requirements.txt python -m spacy download en_core_web_sm ``` ### Training Table Corpora Extraction #### CommonCrawl WDC Web Table Corpus 2015 Details of the dataset could be found at [here](http://webdatacommons.org/webtables/2015/downloadInstructions.html). We used the English relational tables split, which could be downloaded at [here](http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/englishCorpus/compressed/). The script to preprocess the data is at `scripts/preprocess_commoncrawl_tables.sh`. The following command pre-processes [a sample](http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/sample.gz) of the whole WDC dataset. To preprocess the whole dataset, simply replace the `input_file` with the root folder of the downloaded tar ball files. ```shell script mkdir -p data/datasets wget http://data.dws.informatik.uni-mannheim.de/webtables/2015-07/sample.gz -P data/datasets gzip -d < data/datasets/sample.gz > data/datasets/commoncrawl.sample.jsonl python \ -m preprocess.common_crawl \ --worker_num 12 \ --input_file data/datasets/commoncrawl.sample.jsonl \ --output_file data/preprocessed_data/common_crawl.preprocessed.jsonl ``` #### Wikipedia Tables The script to extract Wiki tables is at `scripts/extract_wiki_tables.sh`. It demonstrates extracting tables from a sampled Wikipedia dump. Again, you may need the full Wikipedida dump to perform data extraction. ### Notes for Table Extraction **Extract Tables from Scraped HTML Pages** Most code in `preprocess.extract_wiki_data` is for extracting surrounding natural language sentences around tables. If you are only interested in extracting tables (e.g., from scraped Wiki Web pages), you could just use the `extract_table_from_html` function. See the comments for more details. ## Training Data Generation This section documents how to generate training data for masked language modeling training from extracted and preprocessed tables. The scripts to generate training data for our vanilla `TaBERT(K=1)` and vertical attention `TaBERT(k=3)` models are `utils/generate_vanilla_tabert_training_data.py` and `utils/generate_vertical_tabert_training_data.py`. They are heavily optimized for generating data in parallel in a distributed compute environment, but could still be used locally. The following script assumes you have concatenated the `.jsonl` files obtained from running the data extraction scripts on Wikipedia and CommonCrawl corpora and saved to `data/preprocessed_data/tables.jsonl` ```shell script cd data/preprocessed_data cat common_crawl.preprocessed.jsonl wiki_tables.jsonl > tables.jsonl ``` The following script generates training data for a vanilla `TaBERT(K=1)` model: ```shell script output_dir=data/train_data/vanilla_tabert mkdir -p ${output_dir} python -m utils.generate_vanilla_tabert_training_data \ --output_dir ${output_dir} \ --train_corpus data/preprocessed_data/tables.jsonl \ --base_model_name bert-base-uncased \ --do_lower_case \ --epochs_to_generate 15 \ --max_context_len 128 \ --table_mask_strategy column \ --context_sample_strategy concate_and_enumerate \ --masked_column_prob 0.2 \ --masked_context_prob 0.15 \ --max_predictions_per_seq 200 \ --cell_input_template 'column|type|value' \ --column_delimiter "[SEP]" ``` The following script generates training data for a `TaBERT(K=3)` model with vertical self-attention: ```shell script output_dir=data/train_data/vertical_tabert mkdir -p ${output_dir} python -m utils.generate_vertical_tabert_training_data \ --output_dir ${output_dir} \ --train_corpus data/preprocessed_data/tables.jsonl \ --base_model_name bert-base-uncased \ --do_lower_case \ --epochs_to_generate 15 \ --max_context_len 128 \ --table_mask_strategy column \ --context_sample_strategy concate_and_enumerate \ --masked_column_prob 0.2 \ --masked_context_prob 0.15 \ --max_predictions_per_seq 200 \ --cell_input_template 'column|type|value' \ --column_delimiter "[SEP]" ``` **Parallel Data Generation** The script has two additional arguments, `--global_rank` and `--world_size`. To generate training data in parallel using `N` processes, just fire up `N` processes with the same set of arguments and `--world_size=N`. The argument `--global_rank` is set to `[1, 2, ..., N]` for each process. ## Model Training Our models are trained on a cluster of 32GB Tesla V100 GPUs. The following script demonstrates training a vanilla `TaBERT(k=1)` model using a single GPU with gradient accumulation: ```shell script mkdir -p data/runs/vanilla_tabert python train.py \ --task vanilla \ --data-dir data/train_data/vanilla_tabert \ --output-dir data/runs/vanilla_tabert \ --table-bert-extra-config '{}' \ --train-batch-size 8 \ --gradient-accumulation-steps 32 \ --learning-rate 2e-5 \ --max-epoch 10 \ --adam-eps 1e-08 \ --weight-decay 0.0 \ --fp16 \ --clip-norm 1.0 \ --empty-cache-freq 128 ``` The following script shows training a `TaBERT(k=3)` model with vertical self-attention: ```shell script mkdir -p data/runs/vertical_tabert python train.py \ --task vertical_attention \ --data-dir data/train_data/vertical_tabert \ --output-dir data/runs/vertical_tabert \ --table-bert-extra-config '{"base_model_name": "bert-base-uncased", "num_vertical_attention_heads": 6, "num_vertical_layers": 3, "predict_cell_tokens": true}' \ --train-batch-size 8 \ --gradient-accumulation-steps 64 \ --learning-rate 4e-5 \ --max-epoch 10 \ --adam-eps 1e-08 \ --weight-decay 0.01 \ --fp16 \ --clip-norm 1.0 \ --empty-cache-freq 128 ``` Distributed training with multiple GPUs is similar to [XLM](https://github.com/facebookresearch/XLM). ## Reference If you plan to use `TaBERT` in your project, please consider citing [our paper](https://arxiv.org/abs/2005.08314): ``` @inproceedings{yin20acl, title = {Ta{BERT}: Pretraining for Joint Understanding of Textual and Tabular Data}, author = {Pengcheng Yin and Graham Neubig and Wen-tau Yih and Sebastian Riedel}, booktitle = {Annual Conference of the Association for Computational Linguistics (ACL)}, month = {July}, year = {2020} } ``` ## License TaBERT is CC-BY-NC 4.0 licensed as of now.