# DCL **Repository Path**: whs075/DCL ## Basic Information - **Project Name**: DCL - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-29 - **Last Updated**: 2025-08-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 🚀 Source Code for Disentangled Counterfactual Learning for Physical Commonsense Reasoning (NeurIPS 2023) 🧠 Welcome to the official repository for our NeurIPS 2023 paper! This README will guide you through the setup and execution of our experiments. ## 📄 Paper - **Main Paper:** [Disentangled Counterfactual Learning for Physical Commonsense Reasoning](https://arxiv.org/pdf/2310.19559) ## 🆕 Updates - **July 24, 2025:** We have added inference code for Qwen2.5-VL, as well as code for fine-tuning Qwen2.5-VL on the PACS dataset. Additionally, we provide training and inference code for a new baseline using Qwen2.5-VL's visual encoder. More distance metrics for physical knowledge correlation have also been included. - **February 17, 2025:** We are excited to announce the release of our extended version, **Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning (RDCL)**, now available on arXiv. In this version, we explore scenarios involving missing modalities and introduce a new dataset based on VLM descriptions of visual information for each object. - **Extended Paper:** [Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning](https://arxiv.org/pdf/2502.12425) - **Dataset:** [Baidu Netdisk](https://pan.baidu.com/s/1Ei76NNkb1CFt8FJkDJDFMg) (Extraction Code: `v458`) --- ## 📥 Downloading Model Weights To get started, download the pretrained weights for CLIP and AudioCLIP. ### CLIP ```bash wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt ``` ### AudioCLIP Download `AudioCLIP-Partial-Training.pt` and `bpe_simple_vocab_16e6.txt.gz` from the [AudioCLIP Releases](https://github.com/AndreyGuzhov/AudioCLIP/releases). After downloading, place the models into the `assets` folder. --- ## 🛠️ Requirements We recommend the following environment: - **Python:** 3.8.10 - **PyTorch:** 1.11.0 - **CUDA:** 11.3 --- ## 🏋️‍♂️ Training ### For NeurIPS 2023 Paper #### 1. PACS Dataset ```bash conda activate PACS python3 train_1.py ``` #### 2. Material Classification Dataset ```bash python3 train_classify.py ``` ### For arXiv 2025 Paper #### 1. PACS Dataset (with missing modality) ```bash conda activate PACS python3 train_1.py --miss_modal audio ``` #### 2. Material Classification Dataset (with missing modality) ```bash python3 train_classify.py --miss_modal audio ``` --- ## 🔮 Prediction After training, you can generate predictions on the test set: ```bash python3 predict.py -model_path PATH_TO_MODEL_WEIGHTS -split test ``` --- ## 🧩 Qwen2.5-VL Integration ### 1. Constructing Qwen Inference Data from PACS ```bash python3 PACS_data/scripts/processing_qwen_inference_multi_video_data.py ``` ### 2. Inference on PACS Test Set with Qwen2.5-VL ```bash python PACS_inference.py --model_dir Qwen/Qwen2.5-VL-3B-Instruct --tokenizer_dir Qwen/Qwen2.5-VL-3B-Instruct --split test --data_type data ``` ### 3. Constructing Qwen Fine-tuning Data from PACS ```bash python3 PACS_data/scripts/processing_qwen_finetune_data.py ``` ### 4. Fine-tuning Qwen2.5-VL on PACS Train Set ```bash cd qwen-vl-finetune sh scripts/sft_PACS.sh ``` > We use 4 V100 GPUs for training. For parameter adjustments, please refer to the [official Qwen2.5-VL codebase](https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-finetune). --- ## 🆕 Baseline Construction with Qwen2.5-VL Visual Encoder Since Qwen2.5-VL inference is slow, we first extract and save the visual and image features from PACS. ### Extract Video Frame Features ```bash python3 extract_feature.py --model_size 3B python3 extract_feature.py --model_size 7B python3 extract_feature.py --model_size 32B ``` ### Extract Single Frame Features ```bash python3 extract_feature_single_image.py --model_size 3B python3 extract_feature_single_image.py --model_size 7B python3 extract_feature_single_image.py --model_size 32B ``` ### Training with Qwen2.5-VL as Baseline ```bash python3 train_qwen_baseline.py --Qwen2_5_Size 3B ``` ### Experiments with Different Distance Metrics ```bash python3 train_qwen_baseline.py --sim_type euclidean python3 train_qwen_baseline.py --sim_type manhattan ``` --- ## 🙏 Acknowledgements This code is adapted from: - [AudioCLIP](https://github.com/AndreyGuzhov/AudioCLIP) - [PACS](https://github.com/samuelyu2002/PACS) We would like to express our gratitude to Andrey Guzhov, Samuel Yu, and all other contributors of these repositories for their invaluable work. 🙌 --- Feel free to reach out if you have any questions or need further assistance! 🚀 ---