Ongoing research training transformer language models at scale, including: BERT & GPT-2
pytorch implementation for Patient Knowledge Distillation for BERT Model Compression
A PyTorch-based knowledge distillation toolkit for natural language processing
Transformers without Tears: Improving the Normalization of Self-Attention