# Awesome-Visual-Transformer

**Repository Path**: pprp/Awesome-Visual-Transformer

## Basic Information

- **Project Name**: Awesome-Visual-Transformer
- **Description**: Collect some papers about transformer with vision. Awesome Transformer with Computer Vision (CV)
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2021-11-22
- **Last Updated**: 2021-11-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Awesome Visual-Transformer [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)

Collect some Transformer with Computer-Vision (CV) papers. 

If you find some overlooked papers, please open issues or pull requests.


## Papers

### Transformer original paper

- [Attention is All You Need](https://arxiv.org/abs/1706.03762) (NIPS 2017)

### Technical blog

- [Chinese Blog] 3W字长文带你轻松入门视觉transformer [[Link](https://zhuanlan.zhihu.com/p/308301901)]
- [Chinese Blog] Vision Transformer 超详细解读 (原理分析+代码解读) [[Link](https://zhuanlan.zhihu.com/p/348593638)]

### Survey
  - Transformers in Vision: A Survey [[paper](https://arxiv.org/abs/2101.01169)]   - 2021.02.22
  - A Survey on Visual Transformer [[paper](https://arxiv.org/abs/2012.12556)]   - 2020.1.30

### arXiv papers
- **[NesT]** Aggregating Nested Transformers [[paper](https://arxiv.org/abs/2105.12723)] 
- **[TAPG]** Temporal Action Proposal Generation with Transformers [[paper](https://arxiv.org/abs/2105.12043)] 
- Boosting Crowd Counting with Transformers [[paper](https://arxiv.org/abs/2105.10926)] 
- COTR: Convolution in Transformer Network for End to End Polyp Detection [[paper](https://arxiv.org/abs/2105.10925)]
- **[TransVOD]** End-to-End Video Object Detection with Spatial-Temporal Transformers [[paper](https://arxiv.org/abs/2105.10920)] [[code](https://github.com/SJTU-LuHe/TransVOD)]
- Intriguing Properties of Vision Transformers [[paper](https://arxiv.org/abs/2105.10497)] [[code](https://git.io/Js15X)] 
- Combining Transformer Generators with Convolutional Discriminators [[paper](https://arxiv.org/abs/2105.10189)]
- Rethinking the Design Principles of Robust Vision Transformer [[paper](https://arxiv.org/abs/2105.07926)]
- Vision Transformers are Robust Learners [[paper](https://arxiv.org/abs/2105.07581)] [[code](https://git.io/J3VO0)]
- Manipulation Detection in Satellite Images Using Vision Transformer [[paper](https://arxiv.org/abs/2105.06373)]
- **[Segmenter]** Segmenter: Transformer for Semantic Segmentation [[paper](https://arxiv.org/abs/2105.05633)] [[code](https://github.com/rstrudel/segmenter)]
- **[Swin-Unet]** Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2105.05537)] [[code](https://github.com/HuCaoFighting/Swin-Unet)]
- Self-Supervised Learning with Swin Transformers [[paper](https://arxiv.org/abs/2105.04553)] [[code](https://github.com/SwinTransformer/Transformer-SSL)]
- **[SCTN]** SCTN: Sparse Convolution-Transformer Network for Scene Flow Estimation [[paper](https://arxiv.org/abs/2105.04447)] 
- **[RelationTrack]** RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation [[paper](https://arxiv.org/abs/2105.04322)]
- **[VGTR]** Visual Grounding with Transformers [[paper](https://arxiv.org/abs/2105.04281)]
- **[PST]** Visual Composite Set Detection Using Part-and-Sum Transformers [[paper](https://arxiv.org/abs/2105.02170)] 
- **[TrTr]** TrTr: Visual Tracking with Transformer [[paper](https://arxiv.org/abs/2105.03817)] [[code](https://github.com/tongtybj/TrTr)]
- **[MOTR]** MOTR: End-to-End Multiple-Object Tracking with TRansformer [[paper](https://arxiv.org/abs/2105.03247)] [[code](https://github.com/megvii-model/MOTR)]
- Attention for Image Registration (AiR): an unsupervised Transformer approach [[paper](https://arxiv.org/abs/2105.02282)] 
- **[TransHash]** TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval [[paper](https://arxiv.org/abs/2105.01823)]
- **[ISTR]** ISTR: End-to-End Instance Segmentation with Transformers [[paper](https://arxiv.org/abs/2105.00637)] [[code](https://github.com/hujiecpp/ISTR)]
- **[CAT]** CAT: Cross-Attention Transformer for One-Shot Object Detection [[paper](https://arxiv.org/abs/2104.14984)] 
- **[CoSformer]** CoSformer: Detecting Co-Salient Object with Transformers [[paper](https://arxiv.org/abs/2104.14729)]
- End-to-End Attention-based Image Captioning [[paper](https://arxiv.org/abs/2104.14721)]
- **[PMTrans]** Pyramid Medical Transformer for Medical Image Segmentation [[paper](https://arxiv.org/abs/2104.14702)]
- **[HandsFormer]** HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction [[paper](https://arxiv.org/abs/2104.14639)]
- **[GasHis-Transformer]** GasHis-Transformer: A Multi-scale Visual Transformer Approach for Gastric Histopathology Image Classification [[paper](https://arxiv.org/abs/2104.14528)] 
- Emerging Properties in Self-Supervised Vision Transformers [[paper](https://arxiv.org/abs/2104.14294)]
- **[InTra]** Inpainting Transformer for Anomaly Detection [[paper](https://arxiv.org/abs/2104.13897)] 
- **[Twins]** Twins: Revisiting Spatial Attention Design in Vision Transformers [[paper](https://arxiv.org/abs/2104.13840)] [[code](https://github.com/Meituan-AutoML/Twins)]
- **[MLMSPT]** Point Cloud Learning with Transformer [[paper](https://arxiv.org/abs/2104.13636)]
- Medical Transformer: Universal Brain Encoder for 3D MRI Analysis [[paper](https://arxiv.org/abs/2104.13633)]
- **[ConTNet]** ConTNet: Why not use convolution and transformer at the same time? [[paper](https://arxiv.org/abs/2104.13497)] [[code](https://github.com/yan-hao-tian/ConTNet)]
- **[DTNet]** Dual Transformer for Point Cloud Analysis [[paper](https://arxiv.org/abs/2104.13044)] 
- Improve Vision Transformers Training by Suppressing Over-smoothing [[paper](https://arxiv.org/abs/2104.12753)] [[code](https://github.com/ChengyueGongR/PatchVisionTransformer)]
- **[Visformer]** Visformer: The Vision-friendly Transformer [[paper](https://arxiv.org/abs/2104.12533)] [[code](https://github.com/danczs/Visformer)]
- Transformer Meets DCFAM: A Novel Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images [[paper](https://arxiv.org/abs/2104.12137)]
- **[VST]** Visual Saliency Transformer [[paper](https://arxiv.org/abs/2104.12099)] 
- **[M3DeTR]** M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers [[paper](https://arxiv.org/abs/2104.11896)] [[code](https://github.com/rayguan97/M3DeTR)]
- **[VidTr]** VidTr: Video Transformer Without Convolutions [[paper](https://arxiv.org/abs/2104.11746)] 
- **[Skeletor]** Skeletor: Skeletal Transformers for Robust Body-Pose Estimation [[paper](https://arxiv.org/abs/2104.11712)] 
- **[FaceT]** Learning to Cluster Faces via Transformer [[paper](https://arxiv.org/abs/2104.11502)]
- **[MViT]** Multiscale Vision Transformers [[paper](https://arxiv.org/abs/2104.11227)] [[code](https://github.com/facebookresearch/SlowFast)]
- **[VATT]** VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text [[paper](https://arxiv.org/abs/2104.11178)]
- **[So-ViT]** So-ViT: Mind Visual Tokens for Vision Transformer [[paper](https://arxiv.org/abs/2104.10935)] [[code](https://github.com/jiangtaoxie/So-ViT)]
- Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet [[paper](https://arxiv.org/abs/2104.10858)] [[code](https://github.com/zihangJiang/TokenLabeling)]
- **[TransRPPG]** TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection [[paper](https://arxiv.org/abs/2104.07419)]
- **[VideoGPT]** VideoGPT: Video Generation using VQ-VAE and Transformers [[paper](https://arxiv.org/abs/2104.10157)]
- **[M2TR]** M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection [[paper](https://arxiv.org/abs/2104.09770)]
- Transformer Transforms Salient Object Detection and Camouflaged Object Detection [[paper](https://arxiv.org/abs/2104.10127)]
- **[TransCrowd]** TransCrowd: Weakly-Supervised Crowd Counting with Transformer [[paper](https://arxiv.org/abs/2104.09116)] [[code](https://github.com/dk-liang/TransCrowd)]
- **[TransVG]** TransVG: End-to-End Visual Grounding with Transformers [[paper](https://arxiv.org/abs/2104.08541)]
- Visual Transformer Pruning [[paper](https://arxiv.org/abs/2104.08500)]
- Self-supervised Video Retrieval Transformer Network [[paper](https://arxiv.org/abs/2104.07993)]
- Vision Transformer using Low-level Chest X-ray Feature Corpus for COVID-19 Diagnosis and Severity Quantification [[paper](https://arxiv.org/abs/2104.07235)]
- **[TransGAN]** TransGAN: Two Transformers Can Make One Strong GAN [[paper](https://arxiv.org/abs/2102.07074)] [[code](https://github.com/VITA-Group/TransGAN)]
- Geometry-Free View Synthesis: Transformers and no 3D Priors [[paper](https://arxiv.org/abs/2104.07652)] [[code](https://git.io/JOnwn)]
- **[CoaT]** Co-Scale Conv-Attentional Image Transformers [[paper](https://arxiv.org/abs/2104.06399)] [[code](https://github.com/mlpc-ucsd/CoaT)]
- **[LocalViT]** LocalViT: Bringing Locality to Vision Transformers [[paper](https://arxiv.org/abs/2104.05707)] [[code](https://github.com/ofsoundof/LocalViT)]
- **[ACTOR]** Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [[paper](https://arxiv.org/abs/2104.05670)]
- **[CIT]** Cloth Interactive Transformer for Virtual Try-On [[paper](https://arxiv.org/abs/2104.05519)] [[code](https://arxiv.org/abs/2104.05519)]
- Handwriting Transformers [[paper](https://arxiv.org/abs/2104.03964)]
- **[SiT]** SiT: Self-supervised vIsion Transformer [[paper](https://arxiv.org/abs/2104.03602)] [[code](https://github.com/Sara-Ahmed/SiT)]
- On the Robustness of Vision Transformers to Adversarial Examples [[paper](https://arxiv.org/abs/2104.02610)]
- An Empirical Study of Training Self-Supervised Visual Transformers [[paper](https://arxiv.org/abs/2104.02057)]
- A Video Is Worth Three Views: Trigeminal Transformers for Video-based Person Re-identification [[paper](https://arxiv.org/abs/2104.01745)]
- **[AOT-GAN]** Aggregated Contextual Transformations for High-Resolution Image Inpainting [[paper](https://arxiv.org/abs/2104.01431)] [[code](https://github.com/researchmm/AOT-GAN-for-Inpainting)]
- Deepfake Detection Scheme Based on Vision Transformer and Distillation [[paper](https://arxiv.org/abs/2104.01353)]
- **[ATAG]** Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [[paper](https://arxiv.org/pdf/2103.16024)] 
- **[LeViT]** LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [[paper](https://arxiv.org/abs/2104.01136)] 
- **[TubeR]** TubeR: Tube-Transformer for Action Detection [[paper](https://arxiv.org/abs/2104.00969)]
- **[AAformer]** AAformer: Auto-Aligned Transformer for Person Re-Identification [[paper](https://arxiv.org/abs/2104.00921)]
- **[TFill]** TFill: Image Completion via a Transformer-Based Architecture [[paper](https://arxiv.org/abs/2104.00845)]
- Group-Free 3D Object Detection via Transformers [[paper](https://arxiv.org/abs/2104.00678)] [[code](https://github.com/zeliu98/Group-Free-3D)]
- **[STGT]** Spatial-Temporal Graph Transformer for Multiple Object Tracking [[paper](https://arxiv.org/abs/2104.00194)] 
- **[YOGO]** You Only Group Once: Efficient Point-Cloud Processing with Token
Representation and Relation Inference Module[[paper](https://arxiv.org/abs/2103.09975)] [[code](https://github.com/chenfengxu714/YOGO.git)]
- Going deeper with Image Transformers[[paper](https://arxiv.org/abs/2103.17239)] 
- **[Stark]** Learning Spatio-Temporal Transformer for Visual Tracking [[paper](https://arxiv.org/abs/2103.17154)] [[code](https://github.com/researchmm/Stark)]
- **[Meta-DETR]** Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning [[paper](https://arxiv.org/abs/2103.11731) [[code](https://github.com/ZhangGongjie/Meta-DETR)]
- **[DA-DETR]** DA-DETR: Domain Adaptive Detection Transformer by Hybrid Attention [[paper](https://arxiv.org/abs/2103.17084)]
- Robust Facial Expression Recognition with Convolutional Visual Transformers [[paper](https://arxiv.org/abs/2103.16854)]
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [[paper](https://arxiv.org/abs/2103.16553)]
- Spatiotemporal Transformer for Video-based Person Re-identification[[paper](https://arxiv.org/abs/2103.16469)] 
- **[PiT]** Rethinking Spatial Dimensions of Vision Transformers [[paper](https://arxiv.org/abs/2103.16302)] [[code](https://github.com/naver-ai/pit)]
- **[TransUNet]** TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation [[paper](https://arxiv.org/abs/2102.04306)] [[code](https://github.com/Beckschen/TransUNet)]
- **[CvT]** CvT: Introducing Convolutions to Vision Transformers [[paper](https://arxiv.org/abs/2103.15808)] [[code](https://github.com/leoxiaobin/CvT)]
- Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [[paper](https://arxiv.org/abs/2103.15358)]
- **[TFPose]** TFPose: Direct Human Pose Estimation with Transformers [[paper](https://arxiv.org/abs/2103.15320)]
- **[TransCenter]** TransCenter: Transformers with Dense Queries for Multiple-Object Tracking [[paper](https://arxiv.org/abs/2103.15145)]
- **[ViViT]** ViViT: A Video Vision Transformer [[paper](https://arxiv.org/abs/2103.15691)]
- **[CrossViT]** CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [[paper](https://arxiv.org/abs/2103.14899)]
- **[TS-CAM]** TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization [[paper](https://arxiv.org/abs/2103.14862)]
- Face Transformer for Recognition [[paper](https://arxiv.org/abs/2103.14803)]
- On the Adversarial Robustness of Visual Transformers [[paper](https://arxiv.org/abs/2103.15670)]
- Understanding Robustness of Transformers for Image Classification [[paper](https://arxiv.org/abs/2103.14586)]
- Lifting Transformer for 3D Human Pose Estimation in Video [[paper](https://arxiv.org/abs/2103.14304)]
- **[GSA-Net]** Global Self-Attention Networks for Image Recognition[[paper](https://arxiv.org/abs/2010.03019)]
- High-Fidelity Pluralistic Image Completion with Transformers [[paper](https://arxiv.org/abs/2103.14031)] [[code](http://raywzy.com/ICT)]
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [[paper](https://arxiv.org/abs/2103.14030)] [[code](https://github.com/microsoft/Swin-Transformer)]
- **[DPT]** Vision Transformers for Dense Prediction [[paper](https://arxiv.org/abs/2103.13413)] [[code](https://github.com/intel-isl/DPT)]
- **[TransFG]** TransFG: A Transformer Architecture for Fine-grained Recognition? [[paper](https://arxiv.org/abs/2103.07976)]
- **[TimeSformer]** Is Space-Time Attention All You Need for Video Understanding? [[paper](https://arxiv.org/abs/2102.05095)]
- Multi-view 3D Reconstruction with Transformer [[paper](https://arxiv.org/abs/2103.12957)] 
- Can Vision Transformers Learn without Natural Images? [[paper](https://arxiv.org/abs/2103.13023)] [[code](https://hirokatsukataoka16.github.io/Vision-Transformers-without-Natural-Images/)]
- Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [[paper](https://arxiv.org/abs/2103.12091)] [[code](https://github.com/ygjwd12345/TransDepth)]
- End-to-End Trainable Multi-Instance Pose Estimation with Transformers [[paper](https://arxiv.org/abs/2103.12115)] 
- Instance-level Image Retrieval using Reranking Transformers [[paper](https://arxiv.org/abs/2103.12424)] [[code](https://arxiv.org/abs/2103.12236)]
- **[BossNAS]** BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search [[paper](https://arxiv.org/abs/2103.12424)] [[code](https://github.com/changlin31/BossNAS)]
- **[CeiT]** Incorporating Convolution Designs into Visual Transformers [[paper](https://arxiv.org/abs/2103.11816)] 
- **[DeepViT]** DeepViT: Towards Deeper Vision Transformer [[paper](https://arxiv.org/abs/2103.11886)] 
- **[TNT]** Transformer in Transformer [[paper](https://arxiv.org/abs/2103.00112)] [[code](https://github.com/huawei-noah/noah-research/tree/master/TNT)]
- Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training [[paper](https://arxiv.org/abs/2103.10043)] 
- 3D Human Pose Estimation with Spatial and Temporal Transformers [[paper](https://arxiv.org/abs/2103.10455)] [[code](https://github.com/zczcwh/PoseFormer)]
- **[SUNETR]** SUNETR: Transformers for 3D Medical Image Segmentation [[paper](https://arxiv.org/abs/2103.10504)] 
- Scalable Visual Transformers with Hierarchical Pooling [[paper](https://arxiv.org/abs/2103.10619)] 
- **[ConViT]** ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [[paper](https://arxiv.org/abs/2103.10697)] 
- **[TransMed]** TransMed: Transformers Advance Multi-modal Medical Image Classification [[paper](https://arxiv.org/abs/2103.05940)] 
- **[U-Transformer]** U-Net Transformer: Self and Cross Attention for Medical Image Segmentation [[paper](https://arxiv.org/abs/2103.06104)] 
- **[SpecTr]** SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation [[paper](https://arxiv.org/abs/2103.03604)] [[code](https://github.com/hfut-xc-yun/SpecTr)]
- **[TransBTS]** TransBTS: Multimodal Brain Tumor Segmentation Using Transformer [[paper](https://arxiv.org/abs/2103.04430)] [[code](https://github.com/Wenxuan-1119/TransBTS)]
- **[SSTN]** SSTN: Self-Supervised Domain Adaptation Thermal
Object Detection for Autonomous Driving [[paper](https://arxiv.org/abs/2103.03150)] 
- **[GANsformer]** Generative Adversarial Transformers [[paper](https://arxiv.org/abs/2103.01209)] [[code](https://github.com/dorarad/gansformer)]
- **[PVT]** Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions [[paper](https://arxiv.org/abs/2102.12122)] [[code](https://github.com/whai362/PVT)]
- Transformer is All You Need:
Multimodal Multitask Learning with a Unified Transformer [[paper](https://arxiv.org/abs/2102.10772)] [[code](https://mmf.sh/)]
- **[MedT]** Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [[paper](https://arxiv.org/abs/2102.10662)] [[code](https://github.com/jeya-maria-jose/Medical-Transformer)]
- **[CPVT]** Do We Really Need Explicit Position Encodings for Vision Transformers? [[paper](https://arxiv.org/abs/2102.10882)] [[code](https://github.com/Meituan-AutoML/CPVT)]
- Deepfake Video Detection Using Convolutional Vision Transformer[[paper](https://arxiv.org/abs/2102.11126)]
- Training Vision Transformers for Image Retrieval[[paper](https://arxiv.org/abs/2102.05644)]
- **[TransReID]** TransReID: Transformer-based Object Re-Identification[[paper](https://arxiv.org/abs/2102.04378)]
- **[VTN]** Video Transformer Network[[paper](https://arxiv.org/abs/2102.00719)]
- **[T2T-ViT]** Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [[paper](https://arxiv.org/abs/2101.11986)] [[code](https://github.com/yitu-opensource/T2T-ViT)]
- **[BoTNet]** Bottleneck Transformers for Visual Recognition [[paper](https://arxiv.org/abs/2101.11605)]
- **[CPTR]** CPTR: Full Transformer Network for Image Captioning [[paper](https://arxiv.org/abs/2101.10804)]
- Learn to Dance with AIST++: Music Conditioned 3D Dance Generation [[paper](https://arxiv.org/abs/2101.08779)] [[code](https://google.github.io/aichoreographer/)]
- **[Trans2Seg]**  Segmenting Transparent Object in the Wild with Transformer [[paper](https://arxiv.org/abs/2101.08461)] [[code](https://github.com/xieenze/Trans2Seg)]
- **[SMCA]**  Fast Convergence of DETR with Spatially Modulated Co-Attention [[paper](https://arxiv.org/abs/2101.07448)]
- Investigating the Vision Transformer Model for Image Retrieval Tasks [[paper](https://arxiv.org/abs/2101.03771)]
- **[Trear]** Trear: Transformer-based RGB-D Egocentric Action Recognition [[paper](https://arxiv.org/abs/2101.03904)]
- **[VisualSparta]** VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [[paper](https://arxiv.org/abs/2101.00265)]
- **[TrackFormer]** TrackFormer: Multi-Object Tracking with Transformers [[paper](https://arxiv.org/abs/2101.02702)]
- **[LETR]** Line Segment Detection Using Transformers without Edges [[paper](https://arxiv.org/abs/2101.01909)]
- **[TAPE]** Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry [[paper](https://arxiv.org/abs/2101.02143)]
- **[TRIQ]** Transformer for Image Quality Assessment [[paper](https://arxiv.org/abs/2101.01097)] [[code](https://github.com/junyongyou/triq)]
- **[TransTrack]** TransTrack: Multiple-Object Tracking with Transformer [[paper](https://arxiv.org/abs/2012.15460)] [[code](https://github.com/PeizeSun/TransTrack)]
- **[TransPose]** TransPose: Towards Explainable Human Pose Estimation by Transformer [[paper](https://arxiv.org/abs/2012.14214)] 
- **[DeiT]** Training data-efficient image transformers & distillation through attention [[paper](https://arxiv.org/abs/2012.12877)] [[code](https://github.com/facebookresearch/deit)]
- **[Pointformer]** 3D Object Detection with Pointformer [[paper](https://arxiv.org/abs/2012.11409)] 
- **[ViT-FRCNN]** Toward Transformer-Based Object Detection [[paper](https://arxiv.org/abs/2012.09958)] 
- **[Taming-transformers]** Taming Transformers for High-Resolution Image Synthesis [[paper](https://arxiv.org/abs/2012.09841)] [[code](https://compvis.github.io/taming-transformers/)]
- **[SceneFormer]** SceneFormer: Indoor Scene Generation with Transformers [[paper](https://arxiv.org/abs/2012.09793)] 
- **[PCT]** PCT: Point Cloud Transformer [[paper](https://arxiv.org/abs/2012.09688)] 
- **[METRO]** End-to-End Human Pose and Mesh Reconstruction with Transformers [[paper](https://arxiv.org/abs/2012.09760)]
- **[PointTransformer]** Point Transformer [[paper](https://arxiv.org/abs/2012.09164)]
- **[PED]** DETR for Pedestrian Detection[[paper](https://arxiv.org/abs/2012.06785)]
- Transformer Guided Geometry Model for Flow-Based Unsupervised Visual Odometry[[paper](https://arxiv.org/abs/2101.02143)]
- **[C-Tran]** General Multi-label Image Classification with Transformers [[paper](https://arxiv.org/abs/2011.14027)]
- **[TSP-FCOS]** Rethinking Transformer-based Set Prediction for Object Detection [[paper](https://arxiv.org/abs/2011.10881)]
- **[ACT]** End-to-End Object Detection with Adaptive Clustering Transformer [[paper](https://arxiv.org/abs/2011.09315)]
- **[STTR]** Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers [[paper](https://arxiv.org/abs/2011.02910v2)] [[code](https://github.com/mli0603/stereo-transformer)]
- **[VTs]** Visual Transformers: Token-based Image Representation and Processing for Computer Vision [[paper](https://arxiv.org/abs/2006.03677)]

### 2021
- **[NDT-Transformer]** NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation (**ICRA**)[[paper](https://arxiv.org/abs/2103.12292)] 
- VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization (**ISIE**) [[paper](https://arxiv.org/abs/2104.10036)]
- Medical Image Segmentation using Squeeze-and-Expansion Transformers  (**IJCAI**) [[paper](https://arxiv.org/abs/2105.09511)]
- Vision Transformer for Fast and Efficient Scene Text Recognition (**ICDAR**) [[paper](https://arxiv.org/abs/2105.08582)]
- **[HOTR]** HOTR: End-to-End Human-Object Interaction Detection with Transformers (**CVPR oral**) [[paper](https://arxiv.org/abs/2104.13682)] 
- High-Resolution Complex Scene Synthesis with Transformers (**CVPRW**) [[paper](https://arxiv.org/abs/2105.06458)]
- **[TransFuser]** Multi-Modal Fusion Transformer for End-to-End Autonomous Driving (**CVPR**) [[paper](https://arxiv.org/abs/2104.09224)] [[code](https://github.com/autonomousvision/transfuser)]
- Pose Recognition with Cascade Transformers  (**CVPR**) [[paper](https://arxiv.org/abs/2104.06976)]
- Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning  (**CVPR**) [[paper](https://arxiv.org/abs/2104.03135)]
- **[LoFTR]** LoFTR: Detector-Free Local Feature Matching with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2104.00680)] [[code](https://zju3dv.github.io/loftr/)]
- Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2103.16553)] 
- **[SETR]** Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2012.15840)] [[code](https://fudan-zvg.github.io/SETR/)]
- **[TransT]** Transformer Tracking  (**CVPR**) [[paper](https://arxiv.org/abs/2103.15436)] [[code](https://github.com/chenxin-dlut/TransT)]
- Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking (**CVPR oral**) [[paper](https://arxiv.org/abs/2103.11681)]
- **[VisTR]** End-to-End Video Instance Segmentation with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2011.14503)]
- Transformer Interpretability Beyond Attention Visualization (**CVPR**) [[paper](https://arxiv.org/abs/2012.09838)] [[code](https://github.com/hila-chefer/Transformer-Explainability)]
- **[IPT]** Pre-Trained Image Processing Transformer (**CVPR**) [[paper](https://arxiv.org/abs/2012.00364)]
- **[UP-DETR]** UP-DETR: Unsupervised Pre-training for Object Detection with Transformers (**CVPR**) [[paper](https://arxiv.org/abs/2011.09094)]
- **[VTNet]** VTNet: Visual Transformer Network for Object Goal Navigation (**ICLR**)[[paper](https://arxiv.org/abs/2105.09447)]
- **[Vision Transformer]** An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (**ICLR**)[[paper](https://arxiv.org/abs/2010.11929)] [[code](https://github.com/google-research/vision_transformer)]
- **[Deformable DETR]** Deformable DETR: Deformable Transformers for End-to-End Object Detection (**ICLR**)[[paper](https://arxiv.org/abs/2010.04159)] [[code](https://github.com/fundamentalvision/Deformable-DETR)]
- **[LAMBDANETWORKS]** MODELING LONG-RANGE INTERACTIONS WITHOUT ATTENTION (**ICLR**) [paper](https://openreview.net/pdf?id=xTJEN-ggl1b)] [[code](https://github.com/lucidrains/lambda-networks)]
- **[LSTR]** End-to-end Lane Shape Prediction with Transformers (**WACV**) [[paper](https://arxiv.org/abs/2011.04233)] [[code](https://github.com/liuruijin17/LSTR)]

### 2020

- **[DETR]** End-to-End Object Detection with Transformers (**ECCV**) [[paper](https://arxiv.org/abs/2005.12872)] [[code](https://github.com/facebookresearch/detr)]
- **[FPT]** Feature Pyramid Transformer (**CVPR**) [[paper](https://arxiv.org/abs/2007.09451)] [[code](https://github.com/ZHANGDONG-NJUST/FPT)]
- **[TTSR]** Learning Texture Transformer Network for Image Super-Resolution (**CVPR**) [[paper](https://arxiv.org/abs/2006.04139)] [[code](https://github.com/researchmm/TTSR)]
- **[STTN]** Learning Joint Spatial-Temporal Transformations for Video Inpainting (**ECCV**) [[paper](https://arxiv.org/abs/2007.10247)] [[code](https://github.com/researchmm/STTN)]

### Acknowledgement

Thanks the template from [Awesome-Crowd-Counting](https://github.com/gjy3035/Awesome-Crowd-Counting)