# segmenter **Repository Path**: xhzhu-robotic/segmenter ## Basic Information - **Project Name**: segmenter - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-11-10 - **Last Updated**: 2023-11-10 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Segmenter: Transformer for Semantic Segmentation ![Figure 1 from paper](./overview.png) [Segmenter: Transformer for Semantic Segmentation](https://arxiv.org/abs/2105.05633) by Robin Strudel*, Ricardo Garcia*, Ivan Laptev and Cordelia Schmid, ICCV 2021. *Equal Contribution 🔥 **Segmenter is now available on [MMSegmentation](https://github.com/open-mmlab/mmsegmentation/tree/master/configs/segmenter).** ## Installation Define os environment variables pointing to your checkpoint and dataset directory, put in your `.bashrc`: ```sh export DATASET=/path/to/dataset/dir ``` Install [PyTorch 1.9](https://pytorch.org/) then `pip install .` at the root of this repository. To download ADE20K, use the following command: ```python python -m segm.scripts.prepare_ade20k $DATASET ``` ## Model Zoo We release models with a Vision Transformer backbone initialized from the [improved ViT](https://arxiv.org/abs/2106.10270) models. ### ADE20K Segmenter models with ViT backbone:

Name	mIoU (SS/MS)	# params	Resolution	FPS	Download
Seg-T-Mask/16	38.1 / 38.8	7M	512x512	52.4	model	config	log
Seg-S-Mask/16	45.3 / 46.9	27M	512x512	34.8	model	config	log
Seg-B-Mask/16	48.5 / 50.0	106M	512x512	24.1	model	config	log
Seg-B/8	49.5 / 50.5	89M	512x512	4.2	model	config	log
Seg-L-Mask/16	51.8 / 53.6	334M	640x640	-	model	config	log

Segmenter models with DeiT backbone:

Name	mIoU (SS/MS)	# params	Resolution	FPS	Download
Seg-B†/16	47.1 / 48.1	87M	512x512	27.3	model	config	log
Seg-B†-Mask/16	48.7 / 50.1	106M	512x512	24.1	model	config	log

### Pascal Context

Name	mIoU (SS/MS)	# params	Resolution	FPS	Download
Seg-L-Mask/16	58.1 / 59.0	334M	480x480	-	model	config	log

### Cityscapes

Name	mIoU (SS/MS)	# params	Resolution	FPS	Download
Seg-L-Mask/16	79.1 / 81.3	322M	768x768	-	model	config	log

## Inference Download one checkpoint with its configuration in a common folder, for example `seg_tiny_mask`. You can generate segmentation maps from your own data with: ```python python -m segm.inference --model-path seg_tiny_mask/checkpoint.pth -i images/ -o segmaps/ ``` To evaluate on ADE20K, run the command: ```python # single-scale evaluation: python -m segm.eval.miou seg_tiny_mask/checkpoint.pth ade20k --singlescale # multi-scale evaluation: python -m segm.eval.miou seg_tiny_mask/checkpoint.pth ade20k --multiscale ``` ## Train Train `Seg-T-Mask/16` on ADE20K on a single GPU: ```python python -m segm.train --log-dir seg_tiny_mask --dataset ade20k \ --backbone vit_tiny_patch16_384 --decoder mask_transformer ``` To train `Seg-B-Mask/16`, simply set `vit_base_patch16_384` as backbone and launch the above command using a minimum of 4 V100 GPUs (~12 minutes per epoch) and up to 8 V100 GPUs (~7 minutes per epoch). The code uses [SLURM](https://slurm.schedmd.com/documentation.html) environment variables. ## Logs To plot the logs of your experiments, you can use ```python python -m segm.utils.logs logs.yml ``` with `logs.yml` located in `utils/` with the path to your experiments logs: ```yaml root: /path/to/checkpoints/ logs: seg-t: seg_tiny_mask/log.txt seg-b: seg_base_mask/log.txt ``` ## Attention Maps To visualize the attention maps for `Seg-T-Mask/16` encoder layer 0 and patch `(0, 21)`, you can use: ```python python -m segm.scripts.show_attn_map seg_tiny_mask/checkpoint.pth \ images/im0.jpg output_dir/ --layer-id 0 --x-patch 0 --y-patch 21 --enc ``` Different options are provided to select the generated attention maps: * `--enc` or `--dec`: Select encoder or decoder attention maps respectively. * `--patch` or `--cls`: `--patch` generates attention maps for the patch with coordinates `(x_patch, y_patch)`. `--cls` combined with `--enc` generates attention maps for the CLS token of the encoder. `--cls` combined with `--dec` generates maps for each class embedding of the decoder. * `--x-patch` and `--y-patch`: Coordinates of the patch to draw attention maps from. This flag is ignored when `--cls` is used. * `--layer-id`: Select the layer for which the attention maps are generated. For example, to generate attention maps for the decoder class embeddings, you can use: ```python python -m segm.scripts.show_attn_map seg_tiny_mask/checkpoint.pth \ images/im0.jpg output_dir/ --layer-id 0 --dec --cls ``` Attention maps for patch `(0, 21)` in `Seg-L-Mask/16` encoder layers 1, 4, 8, 12 and 16: ![Attention maps of patch x=8 and y=21 and encoder layers 1, 4, 8, 12 and 16](./attn_maps_enc.png) Attention maps for the class embeddings in `Seg-L-Mask/16` decoder layer 0: ![Attention maps of cls tokens 7, 15, 18, 22, 36 and 57 and Mask decoder layer 0](./attn_maps_dec.png) ## Video Segmentation Zero shot video segmentation on [DAVIS](https://davischallenge.org/) video dataset with Seg-B-Mask/16 model trained on [ADE20K](https://groups.csail.mit.edu/vision/datasets/ADE20K/).

## BibTex ``` @article{strudel2021, title={Segmenter: Transformer for Semantic Segmentation}, author={Strudel, Robin and Garcia, Ricardo and Laptev, Ivan and Schmid, Cordelia}, journal={arXiv preprint arXiv:2105.05633}, year={2021} } ``` ## Acknowledgements The Vision Transformer code is based on [timm](https://github.com/rwightman/pytorch-image-models) library and the semantic segmentation training and evaluation pipeline is using [mmsegmentation](https://github.com/open-mmlab/mmsegmentation).