Member-only story

|TRANSFORMER|LLM|NORMALIZATION|MODIFICATION|

This Is the End of Normalization, and the Transformer Feels Fine

Exploring the Power of Dynamic Tanh in Transformer Models Without Normalization Layers

Salvatore Raieli

Published in

Level Up Coding

9 min readMar 17, 2025

This work introduces Dynamic Tanh (DyT), a simple replacement for normalization layers in Transformers, showing that Transformers can achieve equal or better performance without them. DyT, inspired by tanh-like mappings in layer normalization, eliminates the need for hyperparameter tuning. We demonstrate its effectiveness across various tasks, including recognition, generation, and self-supervised learning in computer vision and language models. — image generated by the author using AI

The young man knows the rules, but the old man knows the exceptions. — Oliver Wendell Holmes, Sr.

Neural network normalization is a building block of artificial intelligence, and almost any neural network contains at least one normalization layer. Since the arrival of batch normalization in 2015, there have been several alternatives, many of which are domain or model-specific. Layer normalization (Layer Norm, or LN) is probably the most popular since it has been used since the beginning in transformers. The purpose of normalization is to speed up and stabilize convergence, especially for deeper neural networks. This is widely accepted, so much so that if you are looking for alternatives to self-attention and the feed-forward layer but not LN,

Is it really so? can’t you do without it?

The authors of a paper just published by META do not think so; let’s find out together.

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has never been more crucial. If you’re looking for simple, clear explanations of complex AI topics, you’re in the right place. Hit Follow or subscribe for free to stay updated with my latest stories and insights.

Batch Normalization aims to reduce internal covariate shift, and in doing so aims to accelerate the training of deep neural nets. It accomplishes this via a normalization step that fixes the means and variances of layer inputs. — source

Batch normalization is a layer that is between two hidden layers, takes the output of the first layer and normalizes it before passing it as input to the next layer. In short, it calculates the mean of the output and its variance and uses two learnable parameters to better normalize. In other words, the input is shifted (to a different mean)…

Level Up Coding

|TRANSFORMER|LLM|NORMALIZATION|MODIFICATION|

This Is the End of Normalization, and the Transformer Feels Fine

Exploring the Power of Dynamic Tanh in Transformer Models Without Normalization Layers

The authors of a paper just published by META do not think so; let’s find out together.

Published in Level Up Coding

Written by Salvatore Raieli

Responses (4)