# Awesome-Visual-Transformer **Repository Path**: pprp/Awesome-Visual-Transformer ## Basic Information - **Project Name**: Awesome-Visual-Transformer - **Description**: Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV) - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2021-11-22 - **Last Updated**: 2021-11-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Awesome Visual-Transformer [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) Collect some Transformer with Computer-Vision (CV) papers. If you find some overlooked papers, please open issues or pull requests. ## Papers ### Transformer original paper - [Attention is All You Need](https://arxiv.org/abs/1706.03762) (NIPS 2017) ### Technical blog - [Chinese Blog] 3W字长文带你轻松入门视觉transformer [[Link](https://zhuanlan.zhihu.com/p/308301901)] - [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [[Link](https://zhuanlan.zhihu.com/p/348593638)] ### Survey - Transformers in Vision: A Survey [[paper](https://arxiv.org/abs/2101.01169)] - 2021.02.22 - A Survey on Visual Transformer [[paper](https://arxiv.org/abs/2012.12556)] - 2020.1.30 ### arXiv papers - **[NesT]** Aggregating Nested Transformers [[paper](https://arxiv.org/abs/2105.12723)] - **[TAPG]** Temporal Action Proposal Generation with Transformers [[paper](https://arxiv.org/abs/2105.12043)] - Boosting Crowd Counting with Transformers [[paper](https://arxiv.org/abs/2105.10926)] - COTR: Convolution in Transformer Network for End to End Polyp Detection [[paper](https://arxiv.org/abs/2105.10925)] - **[TransVOD]** End-to-End Video Object Detection with Spatial-Temporal Transformers [[paper](https://arxiv.org/abs/2105.10920)] [[code](https://github.com/SJTU-LuHe/TransVOD)] - Intriguing Properties of Vision Transformers [[paper](https://arxiv.org/abs/2105.10497)] [[code](https://git.io/Js15X)] - Combining Transformer Generators with Convolutional Discriminators [[paper](https://arxiv.org/abs/2105.10189)] - Rethinking the Design Principles of Robust Vision Transformer [[paper](https://arxiv.org/abs/2105.07926)] - Vision Transformers are Robust Learners [[paper](https://arxiv.org/abs/2105.07581)] [[code](https://git.io/J3VO0)] - Manipulation Detection in Satellite Images Using Vision Transformer [[paper](https://arxiv.org/abs/2105.06373)] - **[Segmenter]** Segmenter: Transformer for Semantic Segmentation [[paper](https://arxiv.org/abs/2105.05633)] [[code](https://github.com/rstrudel/segmenter)] - **[Swin-Unet]** Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2105.05537)] [[code](https://github.com/HuCaoFighting/Swin-Unet)] - Self-Supervised Learning with Swin Transformers [[paper](https://arxiv.org/abs/2105.04553)] [[code](https://github.com/SwinTransformer/Transformer-SSL)] - **[SCTN]** SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [[paper](https://arxiv.org/abs/2105.04447)] - **[RelationTrack]** RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [[paper](https://arxiv.org/abs/2105.04322)] - **[VGTR]** Visual Grounding with Transformers [[paper](https://arxiv.org/abs/2105.04281)] - **[PST]** Visual Composite Set Detection Using Part-and-Sum Transformers [[paper](https://arxiv.org/abs/2105.02170)] - **[TrTr]** TrTr: Visual Tracking with Transformer [[paper](https://arxiv.org/abs/2105.03817)] [[code](https://github.com/tongtybj/TrTr)] - **[MOTR]** MOTR: End-to-End Multiple-Object Tracking with TRansformer [[paper](https://arxiv.org/abs/2105.03247)] [[code](https://github.com/megvii-model/MOTR)] - Attention for Image Registration (AiR): an unsupervised Transformer approach [[paper](https://arxiv.org/abs/2105.02282)] - **[TransHash]** TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [[paper](https://arxiv.org/abs/2105.01823)] - **[ISTR]** ISTR: End-to-End Instance Segmentation with Transformers [[paper](https://arxiv.org/abs/2105.00637)] [[code](https://github.com/hujiecpp/ISTR)] - **[CAT]** CAT: Cross-Attention Transformer for One-Shot Object Detection [[paper](https://arxiv.org/abs/2104.14984)] - **[CoSformer]** CoSformer: Detecting Co-Salient Object with Transformers [[paper](https://arxiv.org/abs/2104.14729)] - End-to-End Attention-based Image Captioning [[paper](https://arxiv.org/abs/2104.14721)] - **[PMTrans]** Pyramid Medical Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2104.14702)] - **[HandsFormer]** HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [[paper](https://arxiv.org/abs/2104.14639)] - **[GasHis-Transformer]** GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [[paper](https://arxiv.org/abs/2104.14528)] - Emerging Properties in Self-Supervised Vision Transformers [[paper](https://arxiv.org/abs/2104.14294)] - **[InTra]** Inpainting Transformer for Anomaly Detection [[paper](https://arxiv.org/abs/2104.13897)] - **[Twins]** Twins: Revisiting Spatial Attention Design in Vision Transformers [[paper](https://arxiv.org/abs/2104.13840)] [[code](https://github.com/Meituan-AutoML/Twins)] - **[MLMSPT]** Point Cloud Learning with Transformer [[paper](https://arxiv.org/abs/2104.13636)] - Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [[paper](https://arxiv.org/abs/2104.13633)] - **[ConTNet]** ConTNet: Why not use convolution and transformer at the same time? [[paper](https://arxiv.org/abs/2104.13497)] [[code](https://github.com/yan-hao-tian/ConTNet)] - **[DTNet]** Dual Transformer for Point Cloud Analysis [[paper](https://arxiv.org/abs/2104.13044)] - Improve Vision Transformers Training by Suppressing Over-smoothing [[paper](https://arxiv.org/abs/2104.12753)] [[code](https://github.com/ChengyueGongR/PatchVisionTransformer)] - **[Visformer]** Visformer: The Vision-friendly Transformer [[paper](https://arxiv.org/abs/2104.12533)] [[code](https://github.com/danczs/Visformer)] - Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [[paper](https://arxiv.org/abs/2104.12137)] - **[VST]** Visual Saliency Transformer [[paper](https://arxiv.org/abs/2104.12099)] - **[M3DeTR]** M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [[paper](https://arxiv.org/abs/2104.11896)] [[code](https://github.com/rayguan97/M3DeTR)] - **[VidTr]** VidTr: Video Transformer Without Convolutions [[paper](https://arxiv.org/abs/2104.11746)] - **[Skeletor]** Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [[paper](https://arxiv.org/abs/2104.11712)] - **[FaceT]** Learning to Cluster Faces via Transformer [[paper](https://arxiv.org/abs/2104.11502)] - **[MViT]** Multiscale Vision Transformers [[paper](https://arxiv.org/abs/2104.11227)] [[code](https://github.com/facebookresearch/SlowFast)] - **[VATT]** VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [[paper](https://arxiv.org/abs/2104.11178)] - **[So-ViT]** So-ViT: Mind Visual Tokens for Vision Transformer [[paper](https://arxiv.org/abs/2104.10935)] [[code](https://github.com/jiangtaoxie/So-ViT)] - Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [[paper](https://arxiv.org/abs/2104.10858)] [[code](https://github.com/zihangJiang/TokenLabeling)] - **[TransRPPG]** TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [[paper](https://arxiv.org/abs/2104.07419)] - **[VideoGPT]** VideoGPT: Video Generation using VQ-VAE and Transformers [[paper](https://arxiv.org/abs/2104.10157)] - **[M2TR]** M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [[paper](https://arxiv.org/abs/2104.09770)] - Transformer Transforms Salient Object Detection and Camouflaged Object Detection [[paper](https://arxiv.org/abs/2104.10127)] - **[TransCrowd]** TransCrowd: Weakly-Supervised Crowd Counting with Transformer [[paper](https://arxiv.org/abs/2104.09116)] [[code](https://github.com/dk-liang/TransCrowd)] - **[TransVG]** TransVG: End-to-End Visual Grounding with Transformers [[paper](https://arxiv.org/abs/2104.08541)] - Visual Transformer Pruning [[paper](https://arxiv.org/abs/2104.08500)] - Self-supervised Video Retrieval Transformer Network [[paper](https://arxiv.org/abs/2104.07993)] - Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [[paper](https://arxiv.org/abs/2104.07235)] - **[TransGAN]** TransGAN: Two Transformers Can Make One Strong GAN [[paper](https://arxiv.org/abs/2102.07074)] [[code](https://github.com/VITA-Group/TransGAN)] - Geometry-Free View Synthesis: Transformers and no 3D Priors [[paper](https://arxiv.org/abs/2104.07652)] [[code](https://git.io/JOnwn)] - **[CoaT]** Co-Scale Conv-Attentional Image Transformers [[paper](https://arxiv.org/abs/2104.06399)] [[code](https://github.com/mlpc-ucsd/CoaT)] - **[LocalViT]** LocalViT: Bringing Locality to Vision Transformers [[paper](https://arxiv.org/abs/2104.05707)] [[code](https://github.com/ofsoundof/LocalViT)] - **[ACTOR]** Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [[paper](https://arxiv.org/abs/2104.05670)] - **[CIT]** Cloth Interactive Transformer for Virtual Try-On [[paper](https://arxiv.org/abs/2104.05519)] [[code](https://arxiv.org/abs/2104.05519)] - Handwriting Transformers [[paper](https://arxiv.org/abs/2104.03964)] - **[SiT]** SiT: Self-supervised vIsion Transformer [[paper](https://arxiv.org/abs/2104.03602)] [[code](https://github.com/Sara-Ahmed/SiT)] - On the Robustness of Vision Transformers to Adversarial Examples [[paper](https://arxiv.org/abs/2104.02610)] - An Empirical Study of Training Self-Supervised Visual Transformers [[paper](https://arxiv.org/abs/2104.02057)] - A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [[paper](https://arxiv.org/abs/2104.01745)] - **[AOT-GAN]** Aggregated Contextual Transformations for High-Resolution Image Inpainting [[paper](https://arxiv.org/abs/2104.01431)] [[code](https://github.com/researchmm/AOT-GAN-for-Inpainting)] - Deepfake Detection Scheme Based on Vision Transformer and Distillation [[paper](https://arxiv.org/abs/2104.01353)] - **[ATAG]** Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [[paper](https://arxiv.org/pdf/2103.16024)] - **[LeViT]** LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [[paper](https://arxiv.org/abs/2104.01136)] - **[TubeR]** TubeR: Tube-Transformer for Action Detection [[paper](https://arxiv.org/abs/2104.00969)] - **[AAformer]** AAformer: Auto-Aligned Transformer for Person Re-Identification [[paper](https://arxiv.org/abs/2104.00921)] - **[TFill]** TFill: Image Completion via a Transformer-Based Architecture [[paper](https://arxiv.org/abs/2104.00845)] - Group-Free 3D Object Detection via Transformers [[paper](https://arxiv.org/abs/2104.00678)] [[code](https://github.com/zeliu98/Group-Free-3D)] - **[STGT]** Spatial-Temporal Graph Transformer for Multiple Object Tracking [[paper](https://arxiv.org/abs/2104.00194)] - **[YOGO]** You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module[[paper](https://arxiv.org/abs/2103.09975)] [[code](https://github.com/chenfengxu714/YOGO.git)] - Going deeper with Image Transformers[[paper](https://arxiv.org/abs/2103.17239)] - **[Stark]** Learning Spatio-Temporal Transformer for Visual Tracking [[paper](https://arxiv.org/abs/2103.17154)] [[code](https://github.com/researchmm/Stark)] - **[Meta-DETR]** Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [[paper](https://arxiv.org/abs/2103.11731) [[code](https://github.com/ZhangGongjie/Meta-DETR)] - **[DA-DETR]** DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [[paper](https://arxiv.org/abs/2103.17084)] - Robust Facial Expression Recognition with Convolutional Visual Transformers [[paper](https://arxiv.org/abs/2103.16854)] - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [[paper](https://arxiv.org/abs/2103.16553)] - Spatiotemporal Transformer for Video-based Person Re-identification[[paper](https://arxiv.org/abs/2103.16469)] - **[PiT]** Rethinking Spatial Dimensions of Vision Transformers [[paper](https://arxiv.org/abs/2103.16302)] [[code](https://github.com/naver-ai/pit)] - **[TransUNet]** TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [[paper](https://arxiv.org/abs/2102.04306)] [[code](https://github.com/Beckschen/TransUNet)] - **[CvT]** CvT: Introducing Convolutions to Vision Transformers [[paper](https://arxiv.org/abs/2103.15808)] [[code](https://github.com/leoxiaobin/CvT)] - Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [[paper](https://arxiv.org/abs/2103.15358)] - **[TFPose]** TFPose: Direct Human Pose Estimation with Transformers [[paper](https://arxiv.org/abs/2103.15320)] - **[TransCenter]** TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [[paper](https://arxiv.org/abs/2103.15145)] - **[ViViT]** ViViT: A Video Vision Transformer [[paper](https://arxiv.org/abs/2103.15691)] - **[CrossViT]** CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [[paper](https://arxiv.org/abs/2103.14899)] - **[TS-CAM]** TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [[paper](https://arxiv.org/abs/2103.14862)] - Face Transformer for Recognition [[paper](https://arxiv.org/abs/2103.14803)] - On the Adversarial Robustness of Visual Transformers [[paper](https://arxiv.org/abs/2103.15670)] - Understanding Robustness of Transformers for Image Classification [[paper](https://arxiv.org/abs/2103.14586)] - Lifting Transformer for 3D Human Pose Estimation in Video [[paper](https://arxiv.org/abs/2103.14304)] - **[GSA-Net]** Global Self-Attention Networks for Image Recognition[[paper](https://arxiv.org/abs/2010.03019)] - High-Fidelity Pluralistic Image Completion with Transformers [[paper](https://arxiv.org/abs/2103.14031)] [[code](http://raywzy.com/ICT)] - Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [[paper](https://arxiv.org/abs/2103.14030)] [[code](https://github.com/microsoft/Swin-Transformer)] - **[DPT]** Vision Transformers for Dense Prediction [[paper](https://arxiv.org/abs/2103.13413)] [[code](https://github.com/intel-isl/DPT)] - **[TransFG]** TransFG: A Transformer Architecture for Fine-grained Recognition? [[paper](https://arxiv.org/abs/2103.07976)] - **[TimeSformer]** Is Space-Time Attention All You Need for Video Understanding? [[paper](https://arxiv.org/abs/2102.05095)] - Multi-view 3D Reconstruction with Transformer [[paper](https://arxiv.org/abs/2103.12957)] - Can Vision Transformers Learn without Natural Images? [[paper](https://arxiv.org/abs/2103.13023)] [[code](https://hirokatsukataoka16.github.io/Vision-Transformers-without-Natural-Images/)] - Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [[paper](https://arxiv.org/abs/2103.12091)] [[code](https://github.com/ygjwd12345/TransDepth)] - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [[paper](https://arxiv.org/abs/2103.12115)] - Instance-level Image Retrieval using Reranking Transformers [[paper](https://arxiv.org/abs/2103.12424)] [[code](https://arxiv.org/abs/2103.12236)] - **[BossNAS]** BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [[paper](https://arxiv.org/abs/2103.12424)] [[code](https://github.com/changlin31/BossNAS)] - **[CeiT]** Incorporating Convolution Designs into Visual Transformers [[paper](https://arxiv.org/abs/2103.11816)] - **[DeepViT]** DeepViT: Towards Deeper Vision Transformer [[paper](https://arxiv.org/abs/2103.11886)] - **[TNT]** Transformer in Transformer [[paper](https://arxiv.org/abs/2103.00112)] [[code](https://github.com/huawei-noah/noah-research/tree/master/TNT)] - Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [[paper](https://arxiv.org/abs/2103.10043)] - 3D Human Pose Estimation with Spatial and Temporal Transformers [[paper](https://arxiv.org/abs/2103.10455)] [[code](https://github.com/zczcwh/PoseFormer)] - **[SUNETR]** SUNETR: Transformers for 3D Medical Image Segmentation [[paper](https://arxiv.org/abs/2103.10504)] - Scalable Visual Transformers with Hierarchical Pooling [[paper](https://arxiv.org/abs/2103.10619)] - **[ConViT]** ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [[paper](https://arxiv.org/abs/2103.10697)] - **[TransMed]** TransMed: Transformers Advance Multi-modal Medical Image Classification [[paper](https://arxiv.org/abs/2103.05940)] - **[U-Transformer]** U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [[paper](https://arxiv.org/abs/2103.06104)] - **[SpecTr]** SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [[paper](https://arxiv.org/abs/2103.03604)] [[code](https://github.com/hfut-xc-yun/SpecTr)] - **[TransBTS]** TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [[paper](https://arxiv.org/abs/2103.04430)] [[code](https://github.com/Wenxuan-1119/TransBTS)] - **[SSTN]** SSTN: Self-Supervised Domain Adaptation Thermal Object Detection for Autonomous Driving [[paper](https://arxiv.org/abs/2103.03150)] - **[GANsformer]** Generative Adversarial Transformers [[paper](https://arxiv.org/abs/2103.01209)] [[code](https://github.com/dorarad/gansformer)] - **[PVT]** Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [[paper](https://arxiv.org/abs/2102.12122)] [[code](https://github.com/whai362/PVT)] - Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer [[paper](https://arxiv.org/abs/2102.10772)] [[code](https://mmf.sh/)] - **[MedT]** Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [[paper](https://arxiv.org/abs/2102.10662)] [[code](https://github.com/jeya-maria-jose/Medical-Transformer)] - **[CPVT]** Do We Really Need Explicit Position Encodings for Vision Transformers? [[paper](https://arxiv.org/abs/2102.10882)] [[code](https://github.com/Meituan-AutoML/CPVT)] - Deepfake Video Detection Using Convolutional Vision Transformer[[paper](https://arxiv.org/abs/2102.11126)] - Training Vision Transformers for Image Retrieval[[paper](https://arxiv.org/abs/2102.05644)] - **[TransReID]** TransReID: Transformer-based Object Re-Identification[[paper](https://arxiv.org/abs/2102.04378)] - **[VTN]** Video Transformer Network[[paper](https://arxiv.org/abs/2102.00719)] - **[T2T-ViT]** Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [[paper](https://arxiv.org/abs/2101.11986)] [[code](https://github.com/yitu-opensource/T2T-ViT)] - **[BoTNet]** Bottleneck Transformers for Visual Recognition [[paper](https://arxiv.org/abs/2101.11605)] - **[CPTR]** CPTR: Full Transformer Network for Image Captioning [[paper](https://arxiv.org/abs/2101.10804)] - Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [[paper](https://arxiv.org/abs/2101.08779)] [[code](https://google.github.io/aichoreographer/)] - **[Trans2Seg]** Segmenting Transparent Object in the Wild with Transformer [[paper](https://arxiv.org/abs/2101.08461)] [[code](https://github.com/xieenze/Trans2Seg)] - **[SMCA]** Fast Convergence of DETR with Spatially Modulated Co-Attention [[paper](https://arxiv.org/abs/2101.07448)] - Investigating the Vision Transformer Model for Image Retrieval Tasks [[paper](https://arxiv.org/abs/2101.03771)] - **[Trear]** Trear: Transformer-based RGB-D Egocentric Action Recognition [[paper](https://arxiv.org/abs/2101.03904)] - **[VisualSparta]** VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [[paper](https://arxiv.org/abs/2101.00265)] - **[TrackFormer]** TrackFormer: Multi-Object Tracking with Transformers [[paper](https://arxiv.org/abs/2101.02702)] - **[LETR]** Line Segment Detection Using Transformers without Edges [[paper](https://arxiv.org/abs/2101.01909)] - **[TAPE]** Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [[paper](https://arxiv.org/abs/2101.02143)] - **[TRIQ]** Transformer for Image Quality Assessment [[paper](https://arxiv.org/abs/2101.01097)] [[code](https://github.com/junyongyou/triq)] - **[TransTrack]** TransTrack: Multiple-Object Tracking with Transformer [[paper](https://arxiv.org/abs/2012.15460)] [[code](https://github.com/PeizeSun/TransTrack)] - **[TransPose]** TransPose: Towards Explainable Human Pose Estimation by Transformer [[paper](https://arxiv.org/abs/2012.14214)] - **[DeiT]** Training data-efficient image transformers & distillation through attention [[paper](https://arxiv.org/abs/2012.12877)] [[code](https://github.com/facebookresearch/deit)] - **[Pointformer]** 3D Object Detection with Pointformer [[paper](https://arxiv.org/abs/2012.11409)] - **[ViT-FRCNN]** Toward Transformer-Based Object Detection [[paper](https://arxiv.org/abs/2012.09958)] - **[Taming-transformers]** Taming Transformers for High-Resolution Image Synthesis [[paper](https://arxiv.org/abs/2012.09841)] [[code](https://compvis.github.io/taming-transformers/)] - **[SceneFormer]** SceneFormer: Indoor Scene Generation with Transformers [[paper](https://arxiv.org/abs/2012.09793)] - **[PCT]** PCT: Point Cloud Transformer [[paper](https://arxiv.org/abs/2012.09688)] - **[METRO]** End-to-End Human Pose and Mesh Reconstruction with Transformers [[paper](https://arxiv.org/abs/2012.09760)] - **[PointTransformer]** Point Transformer [[paper](https://arxiv.org/abs/2012.09164)] - **[PED]** DETR for Pedestrian Detection[[paper](https://arxiv.org/abs/2012.06785)] - Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[[paper](https://arxiv.org/abs/2101.02143)] - **[C-Tran]** General Multi-label Image Classification with Transformers [[paper](https://arxiv.org/abs/2011.14027)] - **[TSP-FCOS]** Rethinking Transformer-based Set Prediction for Object Detection [[paper](https://arxiv.org/abs/2011.10881)] - **[ACT]** End-to-End Object Detection with Adaptive Clustering Transformer [[paper](https://arxiv.org/abs/2011.09315)] - **[STTR]** Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [[paper](https://arxiv.org/abs/2011.02910v2)] [[code](https://github.com/mli0603/stereo-transformer)] - **[VTs]** Visual Transformers: Token-based Image Representation and Processing for Computer Vision [[paper](https://arxiv.org/abs/2006.03677)] ### 2021 - **[NDT-Transformer]** NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation (**ICRA**)[[paper](https://arxiv.org/abs/2103.12292)] - VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (**ISIE**) [[paper](https://arxiv.org/abs/2104.10036)] - Medical Image Segmentation using Squeeze-and-Expansion Transformers (**IJCAI**) [[paper](https://arxiv.org/abs/2105.09511)] - Vision Transformer for Fast and Efficient Scene Text Recognition (**ICDAR**) [[paper](https://arxiv.org/abs/2105.08582)] - **[HOTR]** HOTR: End-to-End Human-Object Interaction Detection with Transformers (**CVPR oral**) [[paper](https://arxiv.org/abs/2104.13682)] - High-Resolution Complex Scene Synthesis with Transformers (**CVPRW**) [[paper](https://arxiv.org/abs/2105.06458)] - **[TransFuser]** Multi-Modal Fusion Transformer for End-to-End Autonomous Driving (**CVPR**) [[paper](https://arxiv.org/abs/2104.09224)] [[code](https://github.com/autonomousvision/transfuser)] - Pose Recognition with Cascade Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2104.06976)] - Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning (**CVPR**) [[paper](https://arxiv.org/abs/2104.03135)] - **[LoFTR]** LoFTR: Detector-Free Local Feature Matching with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2104.00680)] [[code](https://zju3dv.github.io/loftr/)] - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2103.16553)] - **[SETR]** Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2012.15840)] [[code](https://fudan-zvg.github.io/SETR/)] - **[TransT]** Transformer Tracking (**CVPR**) [[paper](https://arxiv.org/abs/2103.15436)] [[code](https://github.com/chenxin-dlut/TransT)] - Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (**CVPR oral**) [[paper](https://arxiv.org/abs/2103.11681)] - **[VisTR]** End-to-End Video Instance Segmentation with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2011.14503)] - Transformer Interpretability Beyond Attention Visualization (**CVPR**) [[paper](https://arxiv.org/abs/2012.09838)] [[code](https://github.com/hila-chefer/Transformer-Explainability)] - **[IPT]** Pre-Trained Image Processing Transformer (**CVPR**) [[paper](https://arxiv.org/abs/2012.00364)] - **[UP-DETR]** UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2011.09094)] - **[VTNet]** VTNet: Visual Transformer Network for Object Goal Navigation (**ICLR**)[[paper](https://arxiv.org/abs/2105.09447)] - **[Vision Transformer]** An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (**ICLR**)[[paper](https://arxiv.org/abs/2010.11929)] [[code](https://github.com/google-research/vision_transformer)] - **[Deformable DETR]** Deformable DETR: Deformable Transformers for End-to-End Object Detection (**ICLR**)[[paper](https://arxiv.org/abs/2010.04159)] [[code](https://github.com/fundamentalvision/Deformable-DETR)] - **[LAMBDANETWORKS]** MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION (**ICLR**) [paper](https://openreview.net/pdf?id=xTJEN-ggl1b)] [[code](https://github.com/lucidrains/lambda-networks)] - **[LSTR]** End-to-end Lane Shape Prediction with Transformers (**WACV**) [[paper](https://arxiv.org/abs/2011.04233)] [[code](https://github.com/liuruijin17/LSTR)] ### 2020 - **[DETR]** End-to-End Object Detection with Transformers (**ECCV**) [[paper](https://arxiv.org/abs/2005.12872)] [[code](https://github.com/facebookresearch/detr)] - **[FPT]** Feature Pyramid Transformer (**CVPR**) [[paper](https://arxiv.org/abs/2007.09451)] [[code](https://github.com/ZHANGDONG-NJUST/FPT)] - **[TTSR]** Learning Texture Transformer Network for Image Super-Resolution (**CVPR**) [[paper](https://arxiv.org/abs/2006.04139)] [[code](https://github.com/researchmm/TTSR)] - **[STTN]** Learning Joint Spatial-Temporal Transformations for Video Inpainting (**ECCV**) [[paper](https://arxiv.org/abs/2007.10247)] [[code](https://github.com/researchmm/STTN)] ### Acknowledgement Thanks the template from [Awesome-Crowd-Counting](https://github.com/gjy3035/Awesome-Crowd-Counting)