Transformers have long relied on normalization layers like LayerNorm to stabilize training and improve performance. However, a groundbreaking study challenges this paradigm by demonstrating that Transformers can achieve equal or better results without any normalization layers. The key lies in a surprisingly simple operation called Dynamic Tanh (DyT), which re-imagines how neural networks process information.

The Role of Normalization in Transformers

Normalization layers were previously considered indispensable for:

Despite these benefits, normalization introduces computational overhead and complicates model architecture. The new research reveals that these layers might be redundant when replaced with a strategically designed activation function.

Introducing Dynamic $\tanh$

The study proposes replacing normalization with:

$$ \text{DyT}(x)=\tanh⁡(\alpha x) $$

where αα is a learnable or fixed scalar parameter. This operation mimics the S-shaped input-output mappings observed in normalized Transformers while being computationally lighter.

Key advantages of DyT:

Performance Across Domains

Experiments show DyT-based Transformers match or exceed normalized counterparts in:

Notably, models using DyT demonstrate faster training convergence in 68% of tested configurations, challenging the notion that normalization is essential for stable learning1.

Task Type Performance Outcome
Image Classification +0.3% top-1 accuracy on ImageNet
Language Modeling Comparable perplexity on WikiText-103
Self-Supervised Learning Improved linear probe accuracy (+1.2%)
Text Generation Higher BLEU scores in machine translation

Implications for AI Development

This discovery has far-reaching consequences:

  1. Simpler architectures: Removing normalization layers reduces model complexity