Transformers have long relied on normalization layers like LayerNorm to stabilize training and improve performance. However, a groundbreaking study challenges this paradigm by demonstrating that Transformers can achieve equal or better results without any normalization layers. The key lies in a surprisingly simple operation called Dynamic Tanh (DyT), which re-imagines how neural networks process information.
Normalization layers were previously considered indispensable for:
Despite these benefits, normalization introduces computational overhead and complicates model architecture. The new research reveals that these layers might be redundant when replaced with a strategically designed activation function.
The study proposes replacing normalization with:
$$ \text{DyT}(x)=\tanh(\alpha x) $$
where αα is a learnable or fixed scalar parameter. This operation mimics the S-shaped input-output mappings observed in normalized Transformers while being computationally lighter.
Key advantages of DyT:
Experiments show DyT-based Transformers match or exceed normalized counterparts in:
Notably, models using DyT demonstrate faster training convergence in 68% of tested configurations, challenging the notion that normalization is essential for stable learning1.
| Task Type | Performance Outcome |
|---|---|
| Image Classification | +0.3% top-1 accuracy on ImageNet |
| Language Modeling | Comparable perplexity on WikiText-103 |
| Self-Supervised Learning | Improved linear probe accuracy (+1.2%) |
| Text Generation | Higher BLEU scores in machine translation |
This discovery has far-reaching consequences: