Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick, “Masked Autoencoders Are Scalable Vision Learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, pp. 16000–16009, doi: 10.1109/CVPR52688.2022.01553.
@inproceedings{he_masked_2022,
author = {He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll\'ar, Piotr and Girshick, Ross},
title = {Masked Autoencoders Are Scalable Vision Learners},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = jun,
year = {2022},
pages = {16000-16009},
arxivdoi = {10.48550/arXiv.2111.06377},
tldr = {Proposes a masked autoencoder (MAE) for pretraining a Vision Transformer (ViT) for the image recognition task. The masked autoencoder is trained for the reconstruction task, with an asymmetric design; encoder does not take masked patches as input, while the decoder does. For image recognition, the decoder is abandoned and the encoder fine-tuned. Best results: ViT-Huge model, experiments on ImageNet-1K. Ablations abound in the paper.},
supplemental = {https://openaccess.thecvf.com/content/CVPR2022/supplemental/He_Masked_Autoencoders_Are_CVPR_2022_supplemental.pdf},
freepdf = {https://openaccess.thecvf.com/content/CVPR2022/papers/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.pdf},
doi = {10.1109/CVPR52688.2022.01553},
cvf = {https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.html},
code = {https://paperswithcode.com/paper/masked-autoencoders-are-scalable-vision}
}
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3× or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
tl;dr: Proposes a masked autoencoder (MAE) for pretraining a Vision Transformer (ViT) for the image recognition task. The masked autoencoder is trained for the reconstruction task, with an asymmetric design; encoder does not take masked patches as input, while the decoder does. For image recognition, the decoder is abandoned and the encoder fine-tuned. Best results: ViT-Huge model, experiments on ImageNet-1K. Ablations abound in the paper.
Masked Autoencoders Are Scalable Vision Learners.
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick.
arXiv: http://doi.org/10.48550/arXiv.2111.06377
h/t Amy TabbPublisher: http://doi.org/10.1109/CVPR52688.2022.01553
pdf: https://openaccess.thecvf.com/content/CVPR2022/papers/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.pdf
code: https://paperswithcode.com/paper/masked-autoencoders-are-scalable-vision