An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy et al. | ICLR 2021 Oral
In a Nutshell 🥜
Transformers1 have been proven successful in the domain of language processing. This paper2 explores the capability of a very similar architecture, but for image classification tasks.
In particular, the author minimises the changes required for a transformer, by dividing images into patches and further convert them to flattened inputs to compute multi-head attentions.
The results of the paper is highly convincing in that when the dataset is large enough, transformers can overcome the disadvantages of not having transition equivariance and neighbouring information and outperform state-of-the-art methods using CNNs.
Some Thoughts ðŸ’
Transformers seem to be the new trend in the computer vision domain with its stunning performances.
Although there are works3 discussing distillations and other methods to reduce the data required for training vision transformers, the problem of high computation costs still exist.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems(pp. 5998-6008).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021, July). Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (pp. 10347-10357). PMLR.