Transformers without Normalization?

Transformers have long relied on normalization layers like LayerNorm to stabilize training and improve performance. However, a groundbreaking study challenges this paradigm by demonstrating that Transformers can achieve equal or better results without any normalization layers. The key lies in a surprisingly simple operation called Dynamic Tanh (DyT), which re-imagines how neural networks process information.

The Role of Normalization in Transformers

Normalization layers were previously considered indispensable for:

Stabilizing gradient flow during training
Accelerating convergence by reducing internal covariate shift
Enabling deeper architectures through better weight initialization

Despite these benefits, normalization introduces computational overhead and complicates model architecture. The new research reveals that these layers might be redundant when replaced with a strategically designed activation function.

Introducing Dynamic $\tanh$

The study proposes replacing normalization with:

$$ \text{DyT}(x)=\tanh⁡(\alpha x) $$

where αα is a learnable or fixed scalar parameter. This operation mimics the S-shaped input-output mappings observed in normalized Transformers while being computationally lighter.

Key advantages of DyT:

Eliminates per-layer normalization computations
Reduces memory footprint by removing normalization parameters
Maintains performance across diverse tasks without extensive hyper-parameter tuning

Performance Across Domains

Experiments show DyT-based Transformers match or exceed normalized counterparts in:

Notably, models using DyT demonstrate faster training convergence in 68% of tested configurations, challenging the notion that normalization is essential for stable learning1.

Task Type	Performance Outcome
Image Classification	+0.3% top-1 accuracy on ImageNet
Language Modeling	Comparable perplexity on WikiText-103
Self-Supervised Learning	Improved linear probe accuracy (+1.2%)
Text Generation	Higher BLEU scores in machine translation

Implications for AI Development

This discovery has far-reaching consequences:

Simpler architectures: Removing normalization layers reduces model complexity