Masked autoencoder for image recognition; today's reading.

19 Oct 2023 image-recognition masked-autoencoder today-i-read transformers

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick, “Masked Autoencoders Are Scalable Vision Learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2022, pp. 16000–16009, doi: 10.1109/CVPR52688.2022.01553.

@inproceedings{he_masked_2022,
  author = {He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll\'ar, Piotr and Girshick, Ross},
  title = {Masked Autoencoders Are Scalable Vision Learners},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = jun,
  year = {2022},
  pages = {16000-16009},
  arxivdoi = {10.48550/arXiv.2111.06377},
  tldr = {Proposes a masked autoencoder (MAE) for pretraining a Vision Transformer (ViT) for the image recognition task. The masked autoencoder is trained for the reconstruction task, with an asymmetric design; encoder does not take masked patches as input, while the decoder does. For image recognition, the decoder is abandoned and the encoder fine-tuned. Best results: ViT-Huge model, experiments on ImageNet-1K. Ablations abound in the paper.},
  supplemental = {https://openaccess.thecvf.com/content/CVPR2022/supplemental/He_Masked_Autoencoders_Are_CVPR_2022_supplemental.pdf},
  freepdf = {https://openaccess.thecvf.com/content/CVPR2022/papers/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.pdf},
  doi = {10.1109/CVPR52688.2022.01553},
  cvf = {https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.html},
  code = {https://paperswithcode.com/paper/masked-autoencoders-are-scalable-vision}
}

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3× or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

tl;dr: Proposes a masked autoencoder (MAE) for pretraining a Vision Transformer (ViT) for the image recognition task. The masked autoencoder is trained for the reconstruction task, with an asymmetric design; encoder does not take masked patches as input, while the decoder does. For image recognition, the decoder is abandoned and the encoder fine-tuned. Best results: ViT-Huge model, experiments on ImageNet-1K. Ablations abound in the paper.

Masked Autoencoders Are Scalable Vision Learners.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick.

arXiv: http://doi.org/10.48550/arXiv.2111.06377
Publisher: http://doi.org/10.1109/CVPR52688.2022.01553
pdf: https://openaccess.thecvf.com/content/CVPR2022/papers/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.pdf
code: https://paperswithcode.com/paper/masked-autoencoders-are-scalable-vision

h/t Amy Tabb

Share or discuss.

Share on Twitter. Share on Facebook. Share on LinkedIn.

Amy Tabb

Masked autoencoder for image recognition; today's reading.

Share or discuss.

Related Posts

(Camera) Calibration Wizard; today's reading. 19 Oct 2024

Camera-to-camera infrastructure-based calibration; today's reading. 01 Apr 2024

Tree roots and drought; today's reading. 01 Apr 2024

Root reconstruction from images; today's reading. 01 Apr 2024

Potato phenotyping; today's reading. 26 Mar 2024