[논문리뷰] MAE: Masked Autoencoders Are Scalable Vision Learners

May 02, 2023

MAE : Masked Autoencoders Are Scalable Vision Learners

Introduction

Deep Learning에서는 지속적으로 더 큰 capability와 capacity를 가진 architecture가 등장함에 따라, 수백만~수억 개의 dataset을 학습할 수 있게 되었고, 어떻게 bias없이 잘 학습할 수 있는지가 중요해졌습니다.

이 글에서는 "Masked Autoencoders Are scalable Vision Learners" 논문에서 소개한 image encoder 가 large-scale vision dataset 을 보다 효율적이고 잘 학습할 수 있게하는 novel architecture; MAE 에 대해 설명합니다.

What is Masked autoencoder

Big data를 deep learning에 잘 학습시킨 케이스를 뽑자면 단연코 NLP의 foundation model들이 있을것입니다.

그 중에서도 BERT는 랜덤하게 input data에서 일정 부분을 masking하고 해당 부분을 예측하는 masked autoencoding 방법을 사용해 self-supervised learning 을 통한 뛰어난 성능에 도달했습니다.

masked autoencoder의 아이디어를 그대로 computer vision에도 적용해보려는 시도는 꾸준하게 있었으나, 항상 NLP만큼의 기대치에는 미치지 못하는 성능을 보여주었습니다.

따라서 논문에서는 우선 어떠한 요소가 vision과 language에 있어서 masked autoencoding이 다르게 작용하게 하는지에 대해서 다음과 같이 분석하였습니다.

Convolution layer의 특성

최근까지 vision model에서는 convolution layer를 쓰는 경우가 대다수였습니다. convolution layer는 regular grid에서 local하게 동작하기 때문에 mask token이나 positional embedding과 같은 'indicator' 의 역할을 하는 특성들을 잘 통합하지 못합니다.

하지만 이러한 점은, 최근 NLP의 transformer구조가 ViT 등을 통해 vision task에도 적용되기 시작하면서 구조적 차이가 좁혀졌다고 볼 수 있습니다.
Information density

language는 인간에 의해 만들어졌고 매우 semantic하고 information-dense한 특성을 가지고 있습니다. 따라서 모델로 하여금 높은 언어적 이해를 강제하여 매우 고차원적인 정보를 바탕으로 masked word 를 추론하도록 유도할 수 있습니다.

하지만 Image는 매우 spatial redundancy한 특징을 가지고 있습니다. 따라서 모델이 context에 대해서 높은 레벨의 이해력을 가지고 있지 않아도 주변의 픽셀 정보로부터 손쉽게 missing patch(mask) 를 추론할 수 있습니다. 따라서 model은 image 자체의 latent representation을 학습하지 않아도 손쉽게 mask를 추론하기 때문에 encoder의 성능 향상이 잘 이루어지지 않게됩니다.
Decoder role

language와 image 를 reconstruction 하는것에 있어서 decoder의 역할이 다르다고 할 수 있습니다. vision decoder는 pixel을 reconstruct하며, pixel은 비교적 낮은 레벨의 semantic information을 가집니다.

반면 NLP decoder에서는 매우 높은 레벨의 semantic information level을 가지고 있는 missing word를 reconstruct하기 때문에 decoder가 MLP 구조처럼 간단한 레이어를 통해서 trivial 한 결과를 만들 수 있습니다.

저자는 vision에서 image decoder의 design이 encoder를 통해 학습되어진 latent representation의 semantic level을 결정하는데에 매우 중요한 역할을 한다는 것을 발견했습니다.

MAE

위의 분석에 근거하여 논문에서는 간단하고 효율적이고 scalable한 masked autoencoder MAE 를 설계합니다.

MAE는 asymmetric encoder-decoder design으로 되어있는 것이 핵심이라고 할 수 있습니다.

encoder는 mask token을 제외한 오직 "visible patch" 에 대해서만 feature를 학습하고,

decoder는 encoder 의 latent representation과 mask token 을 모두 이용해 pixel을 reconstruct하게 됩니다.

이 때, decoder는 lightweight 구조이기 때문에 mask token 전체를 decoder 에서만 사용하게 하는것으로 매우 큰 computation cost 감소 효과를 만들어낼 수 있습니다.

이러한 구조에서 높은 masking ratio 를 가져가게 될수록 encoder에 input으로 들어가게 되는 patch가 적어져서 계산량이 줄어드는 것 뿐 아니라, 위에서 설명한 Information density 에 대한 문제점도 동시에 해결되는 장점이 있습니다.

Masking

Random sampling

Masking 을 sampling하는 방법은 단순합니다. ViT에서 하는것과 같이 image를 patch로 나눈 뒤, uniform distribution에 따라서 input patch와 mask patch 의 subset으로 나눕니다. 이 방법을 통해 center bias가 생기는것을 방지할 수 있습니다.

random sampling 외에도 block, grid 등의 masking strategy를 적용해 보았지만, random sampling을 할 때 보다 너무 쉬워지거나 어려워져서 encoder가 적절한 latent representation을 학습하는데에 방해가 되는 경향을 보였습니다.
Masking ratio

앞서 설명한 asymmetric 구조의 장점을 극대화 하기 위해서는, 결국 높은 masking ratio에도 encoder 가 잘 학습할 수 있는가? 가 관건이 됩니다.

Image의 information density 특성때문에, encoder가 잘 학습하는 masking ratio는 75% 정도로 꽤 높게 측정됩니다. 75%의 수치는 기존 vision에서의 masking 연구 (20% ~ 50%) 와 NLP model (15%~)에 비해서 상당히 높은 수치로 볼 수 있습니다.
Mask token

MAE 구조에서의 키포인트는 encoder에서 mask token을 skip한다는 것이라고 할 수 있습니다. encoder에서 mask token을 학습시에 사용하게 되면 linear probing task에서 14% 가량의 큰 성능 하락을 보였습니다.

무엇보다 masking ratio가 75%정도로 매우 높기때문에, mask token들을 학습에 모두 사용하면 train, inference시의 computational cost또한 수 배 이상 증가하게 됩니다. (self-attention = quadratic increase)

Encoder

Encoder

encoder는 기본적으로 standard ViT 구조를 똑같이 사용합니다. Image를 patch화 하고, positional embedding이 더해진 path token들을 embed하게됩니다.

하지만 약 4분의1 (25%) 정도의 subset에 대해서만 encoder가 처리하기 때문에, computational cost와 memory 사용량을 대폭 줄여 encoder가 더 큰 사이즈로 쉽게 scale 가능하다는 차별점이 있습니다.

Decoder

Decoder

decoder에서는 encoder에서 skip하였던 mask token을 포함하여 full input에 대해서 작동합니다. 이때는 mask token 까지 포함하여 전체 patch에 대해서 다시 positional embedding을 수행하여 decoder가 전체적인 location information을 알 수 있도록 합니다.

또한 decoder 는 encoder 가 학습한 latent representation을 reconstruct 하여 성능을 평가하기 위한 수단으로 사용되기 때문에 decoder는 오직 pre-training 단계에서만 사용되며 encoder와 완전 독립적인 디자인으로 자유롭게 구성할 수 있습니다.
Decoder design

NLP와 달리, image task에서의 output 은 다소 non-tirivial하기 때문에 decoder 또한 여러 블럭의 transformer block들로 구성하였습니다.

실험 결과, fine tuning, 즉 image reconstruction을 수행하는 경우에는 decoder 의 block이 1개라도 충분한 퍼포먼스를 보여주는것을 확인하였습니다.

반면 image recognition이 필요한 linear probing의 경우 decoder 의 block 개수가 퍼포먼스에 직접적인 영향을 주었습니다.

dimension의 경우에도 fine tuning과 linear probing 모두에서 ~512 dimension 정도면 충분히 좋은 결과를 보여주었으며, 이는 encoder의 1024 보다 더 적은 숫자입니다.

따라서 fine-tuning의 용도일 경우 decoder depth는 1개 512 dimension 으로도 충분하며, 이를 통해 좋은 성능과 적은 연산량을 가질 수 있게됩니다.

Experiments

Partial fine-tuning

mask autoencoder 방식으로 fine-tuning을 진행할 때, encoder의 전체 layer가 아니라 마지막 몇개의 layer의 weight만 fine-tuning하더라도 충분히 좋은 결과를 낼 수 있습니다. (24 = whole)

이는 관련 선행 연구와 MAE의 실험에서도 입증되었으며, 해당 내용에 대한 더 자세한 내용은 아래를 참고해주세요.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In NeurIPS,
2014.
Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In ECCV, 2016.
Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.

Transfer learning

MAE는 detection, segmentation, classification등 다양한 downstream task에도 의미있는 향상을 보여주었습니다.

Conclusions

"Masked Autoencoders Are scalable Vision Learners" 에서는 NLP에서 large scale data를 학습할 때 사용하는 self-supervised learning의 방법을 차용하여 Image model또한 large scale data에 대해서 효율적으로 잘 학습할 수 있는 새로운 방법을 제시하였습니다.

이 방법을 통해서 Vision model또한 방대한 데이터셋에 대해서 self-supervised로 학습을 할 수 있게 되었고, Meta의 "Segment Anything", "Dino" 등의 Foundation 모델에서 수천만개의 real-world image를 encoder에 잘 학습시킨것으로 그 효용성이 입증되었다고 볼 수 있습니다.

또한 이 방법은 특정한 downstream task에 대해서만 적용될 수 있는 것이 아니라, Image encoder 자체의 feature 학습에 도움을 줄 수 있으므로 Vision deep learning에 있어서 의미있는 연구라고 할 수 있습니다.

MAE : Masked Autoencoders Are Scalable Vision Learners