# Explicit-Sparse-Transformer **Repository Path**: zongkw/Explicit-Sparse-Transformer ## Basic Information - **Project Name**: Explicit-Sparse-Transformer - **Description**: code for Explicit Sparse Transformer - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-06-15 - **Last Updated**: 2021-06-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Explicit-Sparse-Transformer In Explicit Sparse Transformer, we propose an algorithm which sparse attention weights in transformer according to their activations. 2020 1/4 we upload code for explicit sparse transformer in tensor2tensor and fairseq, see t2t_envi_est.sh and fairseq_deen_est.sh for details. 2021 1/14 we address an import error related to SparseActivatedMultiheadAttention 2021 5/9 In the preprint, we shown that top-k attention is additive with block sparse method "transformer-xl" which has the static local attention span. Here we find that top-k attention is also additive with an adaptive local sparse attention method "Adaptive Attention Span in Transformers" https://arxiv.org/abs/1905.07799?context=cs.LG and the top-k method can further reduce the length of the learned attention span and thus improves attention efficiency. See the directory of 'adaptive-span' for the detailed implementation. Here is an illustration drawn from training logs: drawing