# ClearCLIP **Repository Path**: cw_zhou/ClearCLIP ## Basic Information - **Project Name**: ClearCLIP - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-12 - **Last Updated**: 2025-11-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Mengcheng Lan1Chaofeng Chen1Yiping Ke2Xinjiang Wang3Litong Feng3Wayne Zhang3
1S-Lab, Nanyang Technological University  2CCDS, Nanyang Technological University  3SenseTime Research 
Accepted to ECCV 2024

[arXiv]

## Abstract > *Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.* ## Dependencies and Installation ``` # git clone this repository git clone https://github.com/mc-lan/ClearCLIP.git cd ClearCLIP # create new anaconda env conda create -n ClearCLIP python=3.10 conda activate ClearCLIP # install torch and dependencies pip install -r requirements.txt ``` ## Datasets We include the following dataset configurations in this repo: 1) `With background class`: PASCAL VOC, PASCAL Context, Cityscapes, ADE20k, and COCO-Stuff164k, 2) `Without background class`: VOC20, Context59 (i.e., PASCAL VOC and PASCAL Context without the background category), and COCO-Object. Please follow the [MMSeg data preparation document](https://github.com/open-mmlab/mmsegmentation/blob/main/docs/en/user_guides/2_dataset_prepare.md) to download and pre-process the datasets. The COCO-Object dataset can be converted from COCO-Stuff164k by executing the following command: ``` python datasets/cvt_coco_object.py PATH_TO_COCO_STUFF164K -o PATH_TO_COCO164K ``` ## Quick Inference ``` python demo.py ``` ## Model evaluation Single-GPU: ``` python eval.py --config ./config/cfg_DATASET.py --workdir YOUR_WORK_DIR ``` Multi-GPU: ``` bash ./dist_test.sh ./config/cfg_DATASET.py ``` Evaluation on all datasets: ``` python eval_all.py ``` Results will be saved in `results.xlsx`. We provide the comparison results (in Appendix) on five datasets without background class by using our implementation as below:
## Citation ``` @inproceedings{lan2024clearclip, title={Clearclip: Decomposing clip representations for dense vision-language inference}, author={Lan, Mengcheng and Chen, Chaofeng and Ke, Yiping and Wang, Xinjiang and Feng, Litong and Zhang, Wayne}, booktitle={European Conference on Computer Vision}, pages={143--160}, year={2024}, organization={Springer} } ``` ## License This project is licensed under NTU S-Lab License 1.0. Redistribution and use should follow this license. ## Acknowledgement This study is supported under the RIE2020 Industry Align- ment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s). This implementation is based on [OpenCLIP](https://github.com/mlfoundations/open_clip) and [SCLIP](https://github.com/wangf3014/SCLIP). Thanks for the awesome work. ## Contact If you have any questions, please feel free to reach out at `lanm0002@e.ntu.edu.sg`.