Background
Modern vision transformers (ViTs) have a common problem that their tokens may have artifacts, which can lead to suboptimal performance.

Only DINO will not have artifacts, which can be a comparison for other methods.
Why and When Artifacts Occur
Relation with High Norms
The author finds that the artifacts are related to the high norms of tokens.

Appearance in Middle of Training
The high norm of tokens appears in the middle of training

Information Redundancy
The author finds that the artifacts are caused by information redundancy, which means if the patch is very similar to its neighbors, it is more possible to have artifacts.

Conclusion
Enough large and well-trained models will learn to recognize redundant tokens and use them to store, process, and extract global information.
So the author proposes Register Token, which is a token that is used to store and process global information.