# intermediate-training-using-clustering **Repository Path**: mirrors_ibm/intermediate-training-using-clustering ## Basic Information - **Project Name**: intermediate-training-using-clustering - **Description**: code for the paper "Cluster & Tune: Boost Cold Start Performance in Text Classification" for ACL2022 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-03-11 - **Last Updated**: 2025-11-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Intermediate Training Using Clustering Code to reproduce the BERT intermediate training experiments from [Shnarch et al. (2022)](#reference). Using this repository you can: (1) Download the datasets used in the paper; (2) Run intermediate training that relies on pseudo-labels from the results of the [sIB](https://github.com/IBM/sib) clustering algorithm; (3) Fine-tune a BERT classifier starting from the default pretrained model (bert-base-uncased) and from the model after intermediate training; (4) Compare the the BERT classification performance with and without the intermediate training stage. **Table of contents** [Installation](#installation) [Running an experiment](#running-an-experiment) [Plotting the results](#plotting-the-results) [Reference](#reference) [License](#license) ## Installation The framework requires Python 3.8 1. Clone the repository locally: `git clone https://github.com/IBM/intermediate-training-using-clustering` 2. Go to the cloned directory `cd intermediate-training-using-clustering` 4. Install the project dependencies: `pip install -r requirements.txt` Windows users may also need to download the latest [Microsoft Visual C++ Redistributable for Visual Studio](https://support.microsoft.com/en-us/help/2977003/the-latest-supported-visual-c-downloads) in order to support tensorflow 3. Run the python script `python download_and_process_datasets.py`. This script downloads and processes 8 datasets used in the paper. ## Running an experiment The experiment script `run_experiment.py` requires 6 arguments: - `train_file`: path to the train data (e.g. datasets/isear/train.csv). - `eval_file`: path to the evaluation data (e.g. datasets/isear/test.csv). - `num_clusters`: number of clusters used to generate the task pseudo labels. Defaults to 50 (as used in the paper) - `labeling_budget`: number of examples from the train data used for BERT fine-tuning (in the paper we tested the following budgets: 64, 128, 192, 256, 384, 512, 768, 1024) - `random_seed`: used for sampling the train data and for model training - `inter_training_epochs`: number of epochs for the intermediate task. Defaults to 1 (as used in the paper) - `finetuning_epochs`: number of epochs for fine-tuning BERT over `labeling_budget` examples. Defaults to 10 (as used in the paper) For example: ```python run_experiment.py --train_file datasets/yahoo_answers/train.csv --eval_file datasets/yahoo_answers/test.csv --num_clusters 50 --labeling_budget 64 --finetuning_epochs 10 --inter_training_epochs 1 --random_seed 0``` The results of the experimental run (accuracy for BERT with and without the intermediate task over the `eval_file`) are written both to the screen, and to `output/results.csv`. Multiple experiments can safely write in parallel to the same `output/results.csv` file - each new result is appended to the file. In addition, for every new result, an aggregation of all the results so far is written to `output/aggregated_results.csv`. This aggregation reflects the mean of all runs for each experimental setting (i.e. with/without intermediate training) for a particular eval_file and labeling budget. ## Plotting the results In order to show the effect of the intermediate task in different labeling budgets, run `python plot.py`. This script generates plots under `output/plots` for each dataset. For example: ![Alt text](example_plot.png?raw=true "Output image of plot.py after running 5 seeds over 8 labeling budgets for dbpedia") ## Reference Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov and Noam Slonim (2022). [Cluster & Tune: Boost Cold Start Performance in Text Classification](https://aclanthology.org/2022.acl-long.526/). ACL 2022 Please cite: ``` @inproceedings{shnarch-etal-2022-cluster, title = "Cluster & Tune: Boost Cold Start Performance in Text Classification", author = "Shnarch, Eyal and Gera, Ariel and Halfon, Alon and Dankin, Lena and Choshen, Leshem and Aharonov, Ranit and Slonim, Noam", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.526", pages = "7639--7653", } ``` ## License This work is released under the Apache 2.0 license. The full text of the license can be found in [LICENSE](LICENSE).