Background

Modern vision transformers (ViTs) have a common problem that their tokens may have artifacts, which can lead to suboptimal performance.

Only DINO will not have artifacts, which can be a comparison for other methods.

Why and When Artifacts Occur

Relation with High Norms

The author finds that the artifacts are related to the high norms of tokens.

Appearance in Middle of Training

The high norm of tokens appears in the middle of training

Information Redundancy

The author finds that the artifacts are caused by information redundancy, which means if the patch is very similar to its neighbors, it is more possible to have artifacts.

Conclusion

Enough large and well-trained models will learn to recognize redundant tokens and use them to store, process, and extract global information.

So the author proposes Register Token, which is a token that is used to store and process global information.