# LexMAE **Repository Path**: Lauren_123/LexMAE ## Basic Information - **Project Name**: LexMAE - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-11-20 - **Last Updated**: 2023-11-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # LexMAE Pre-training and Fine-tuning - This is a python implementation with PyTorch for our paper [**LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval**](https://openreview.net/forum?id=SHD0Dc1M5r) - This is an early release version and would have some bugs caused by code reorganizations. ## TODO - [] Fine-tuned Models - [] Pre-trained Models ## Env ### Python Please use conda env. ``` conda create -n lexmae python=3.7 conda activate lexmae conda install pytorch=1.9.1 torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia pip install transformers==4.11.3 pip install gpustat ipython jupyter datasets accelerate sklearn tensorboard nltk pytrec_eval conda install -c conda-forge faiss-gpu pip install prettytable gradio pip install setuptools==59.5.0 ``` To use mixed precision, please `export MIXED_PRECISION=fp16` ### Anserini Install Anserini for Sparse Lexcon-based Retrieval. ``` # I personally recommend to install java and maven in Anaconda for Anserini conda install -c conda-forge openjdk=11 maven # Please install anserini-0.14.0 As we only tested on ths version. Just follow https://github.com/castorini/anserini/tree/anserini-0.14.0#getting-started ``` We will involke aserini from python script for lexicon-based retrieval. Suppose we have env var $ANSERINI_PATH to `.../anserini` ## Data Preparation ### Overview of Data Dir ``` $DATA_DIR/ --msmarco/ ----collection.tsv ----collection.tsv.title.tsv (titles, copied from https://github.com/texttron/tevatron) ----passage_ranking/ ------train.query.txt [502939 lines] ------qrels.train.tsv [532761 lines] ------train.negatives.tsv [400782 lines] (BM25 negatives, copied from tevatron) ------dev.query.txt [6980 lines] ------qrels.dev.tsv [7437 lines] ------top1000.dev [6668967 lines] ------test2019.query.txt [200 lines] ------qrels.test2019.tsv [9260 lines] ------top1000.test2019 [189877 lines] ``` If not specified, please download the file from official, https://microsoft.github.io/msmarco/Datasets, and then rename it accordingly. ## Pre-training ``` LEXMAE_DIR="/TO-DIR-PATH-SAVING-LEXMAE" DATA_DIR="/TO-DIR-PATH-WITH-DATA-FILES" DATA_NAME="msmarco/fullpsgs.tsv" export MIXED_PRECISION=fp16 INIT_MODEL_DIR="bert-base-uncased --encoder bert"; NUM_EPOCH="20"; BATCH="2048"; LR="3e-4"; SEED="42"; DATA_P="0.30"; DEC_P="0.50"; PARAMS=""; NAME=""; python3 -m torch.distributed.run --nproc_per_node=8 \ -m script_lexmae \ --do_train \ --model_name_or_path ${INIT_MODEL_DIR} \ --warmup_proportion 0.05 --weight_decay 0.01 --max_grad_norm 0. \ --data_dir $DATA_DIR \ --data_rel_paths msmarco/fullpsgs.tsv --max_length 144 \ --output_dir ${LEXMAE_DIR} \ --logging_steps 100 --eval_batch_size 48 \ --eval_steps -1 --dev_key_metric none \ --data_load_type disk --num_proc 4 --seed ${SEED} \ --gradient_accumulation_steps 1 --learning_rate ${LR} --num_train_epochs ${NUM_EPOCH} --train_batch_size ${BATCH} \ --data_mlm_prob ${DATA_P} --enc_mlm_prob 0.00 \ --dec_mlm_prob ${DEC_P} ${PARAMS} --tensorboard_steps 100 ``` Note that, an extra step is required to get `msmarco/fullpsgs.tsv`. That is combining `msmarco/collection.tsv.title.tsv` and `msmarco/collection.tsv` with columns of PID, TITLE and PASSAGE. ## Fine-tuning First, set env variables: ``` export MIXED_PRECISION=fp16 export CUDA_VISIBLE_DEVICES=0 ``` ### Stage 1 #### Fine-tuning ``` STG1_MODEL_DIR="/SET-A-DIR-NAME" python3 \ -m proj_sparse.train_splade_retriever \ --do_train \ --encoder ${ENCODER_TYPE} --model_name_or_path ${LEXMAE_DIR} \ --max_length 144 ${MODEL_PARAMS} \ --warmup_proportion 0.05 --weight_decay 0.01 --max_grad_norm 1. \ --data_dir $DATA_DIR --overwrite_output_dir \ --output_dir $STG1_MODEL_DIR \ --logging_steps 100 --num_dev 6980 --eval_batch_size 48 \ --data_load_type memory --num_proc 3 --gradient_accumulation_steps 1 \ --eval_steps 10000 --seed ${SEED} --lambda_d 0.${LMD} --lambda_ratio ${LMD_R} \ --learning_rate ${LR} --num_train_epochs ${NUM_EPOCH} --train_batch_size ${BATCH} \ --negs_sources official --num_negs_per_system ${NUM_SYS} --num_negatives ${NUM_NEG} \ --tensorboard_steps 100 --do_xentropy ENCODER_TYPE="bert"; MODEL_PARAMS=""; LMD="0020"; LMD_R="0.75"; NUM_SYS="1000"; NUM_NEG="15"; NUM_EPOCH="3"; BATCH="24"; LR="2e-5"; SEED="42"; ``` #### Eval ``` python3 \ -m proj_sparse.train_splade_retriever \ --do_prediction \ --model_name_or_path $STG1_MODEL_DIR \ --seed 42 --anserini_path $ANSERINI_PATH \ --data_dir $DATA_DIR --overwrite_output_dir \ --output_dir ${STG1_MODEL_DIR}-EVAL \ --data_load_type disk --num_proc 3 --max_length 144 --eval_batch_size 160 \ --hits_num 1000 --encoder bert ``` #### Hard Negative Mining ``` python3 \ -m proj_sparse.train_splade_retriever \ --do_hn_gen \ --model_name_or_path $STG1_MODEL_DIR \ --seed 42 --anserini_path $ANSERINI_PATH \ --data_dir $DATA_DIR --overwrite_output_dir \ --output_dir ${STG1_MODEL_DIR}-EVAL \ --data_load_type disk --num_proc 3 --max_length 144 --eval_batch_size 160 \ --hn_gen_num 512 --encoder bert ``` ### Stage 2 #### Fine-tuning ``` STG2_MODEL_DIR="/SET-A-DIR-NAME" STAGE1_HN="${STG1_MODEL_DIR}-EVAL/sparse_retrieval/qid2negatives.pkl"; MODEL_PARAMS=""; LMD="0080"; LMD_R="0.75"; NUM_SYS="200"; NUM_NEG="15"; NUM_EPOCH="3"; BATCH="24"; LR="2e-5"; SEED="42"; python3 \ -m proj_sparse.train_splade_retriever \ --do_train \ --encoder ${ENCODER_TYPE} --model_name_or_path ${LEXMAE_DIR} \ --max_length 144 ${MODEL_PARAMS} \ --warmup_proportion 0.05 --weight_decay 0.01 --max_grad_norm 1. \ --data_dir $DATA_DIR --overwrite_output_dir \ --output_dir ${STG2_MODEL_DIR} \ --logging_steps 100 --num_dev 6980 --eval_batch_size 48 \ --data_load_type memory --num_proc 3 --gradient_accumulation_steps 1 \ --eval_steps 10000 --seed ${SEED} --lambda_d 0.${LMD} --lambda_ratio ${LMD_R} \ --learning_rate ${LR} --num_train_epochs ${NUM_EPOCH} --train_batch_size ${BATCH} \ --negs_sources custom --negs_source_paths ${STAGE1_HN} --num_negs_per_system ${NUM_SYS} --num_negatives ${NUM_NEG} \ --tensorboard_steps 100 --do_xentropy ``` #### Eval ``` python3 \ -m proj_sparse.train_splade_retriever \ --do_prediction \ --model_name_or_path $STG2_MODEL_DIR \ --seed 42 --anserini_path $ANSERINI_PATH \ --data_dir $DATA_DIR --overwrite_output_dir \ --output_dir ${STG2_MODEL_DIR}-EVAL \ --data_load_type disk --num_proc 3 --max_length 144 --eval_batch_size 160 \ --hits_num 1000 --encoder bert ``` #### Hard Negative Mining ``` python3 \ -m proj_sparse.train_splade_retriever \ --do_hn_gen \ --model_name_or_path $STG2_MODEL_DIR \ --seed 42 --anserini_path $ANSERINI_PATH \ --data_dir $DATA_DIR --overwrite_output_dir \ --output_dir ${STG2_MODEL_DIR}-EVAL \ --data_load_type disk --num_proc 3 --max_length 144 --eval_batch_size 160 \ --hn_gen_num 512 --encoder bert ``` ### Stage 3 #### Fine-tuning ``` STG3_MODEL_DIR="/SET-A-DIR-NAME" RANKER_DIR="/PATH-TO-DIR" STAGE2_HN="${STG2_MODEL_DIR}-EVAL/sparse_retrieval/qid2negatives.pkl"; MODEL_PARAMS="--distill_reranker ${RANKER_DIR} --xentropy_sparse_loss_weight 0.2"; LMD="0080"; LMD_R="0.75"; NUM_SYS="1000"; NUM_NEG="23"; NUM_EPOCH="3"; BATCH="16"; LR="2e-5"; SEED="42"; python3 \ -m proj_sparse.train_splade_retriever \ --do_train \ --encoder ${ENCODER_TYPE} --model_name_or_path ${INIT_MODEL_DIR} \ --max_length 144 ${MODEL_PARAMS} \ --warmup_proportion 0.05 --weight_decay 0.01 --max_grad_norm 1. \ --data_dir $DATA_DIR --overwrite_output_dir \ --output_dir ${STG3_MODEL_DIR} \ --logging_steps 100 --num_dev 6980 --eval_batch_size 48 \ --data_load_type memory --num_proc 3 --gradient_accumulation_steps 1 \ --eval_steps 10000 --seed ${SEED} --lambda_d 0.${LMD} --lambda_ratio ${LMD_R} \ --learning_rate ${LR} --num_train_epochs ${NUM_EPOCH} --train_batch_size ${BATCH} \ --negs_sources custom --negs_source_paths ${STAGE2_HN} --num_negs_per_system ${NUM_SYS} --num_negatives ${NUM_NEG} \ --tensorboard_steps 100 --do_xentropy ``` #### Eval ``` python3 \ -m proj_sparse.train_splade_retriever \ --do_prediction \ --model_name_or_path $STG3_MODEL_DIR \ --seed 42 --anserini_path $ANSERINI_PATH \ --data_dir $DATA_DIR --overwrite_output_dir \ --output_dir ${STG3_MODEL_DIR}-EVAL \ --data_load_type disk --num_proc 3 --max_length 144 --eval_batch_size 160 \ --hits_num 1000 --encoder bert ```