CLIP : Learning Transferable Visual Models From Natural Language Supervision (2021)

Introduction

CLIP은 OpenAI에서 개발한 image와 text의 joint embedding space를 학습하는 모델이다. DALL-E 서비스에 사용된 것으로 알려져있다.

DALL-E

CLIP은 이미지와 텍스트의 의미 있고 robust한 표현을 shared embedding space에서 학습하도록 설계되었다.

⇒ 이미지와 텍스트의 joint embedding을 학습.

즉 비슷한 개념을 담고있는 이미지와 텍스트는 embedding space 내에서 가깝다는 것이다.

AI 모델의 자연어와 이미지 이해의 차이를 해결해주었다.

CLIP은 natural language로 image representation을 supervision하는 supervised learning이라고 할 수 있다.

이러한 시도는 처음 있는 것이 아니다.

ConVIRT
- CLIP 에서 사용한 방식

g_v는 non-linear projection

Untitled

VirTex
- language supervised pretraining, downstream trasfer 두 단계로 이루어짐
- language supervised pretraining : image captioning
- downstream task : object detection

Untitled

ICMLM
- naver labs europe
- ECCV 2020
- 일반적인 masked language model을 기반으로 이미지의 conv net features를 결합하여 학습

Figure 2: Illustration of the masked language modeling (MLM) and image-conditioned masked language modeling (ICMLM) tasks. Our work builds on MLM - which has become standard in natural language processing - to extend it to the visual domain, enabling the creation of strong and generic visual representations.

그러나 성능이 잘 나오지 않았고, 오히려 weak supervision (대부분의 self-supervision 방식을 말하는 듯 하다) 방식이 좋은 성능을 보였다고 한다.

ex. 인스타그램 이미지들의 관련 해쉬태그 예측을 pretraining task로 사용한 후 (단, ImageNet 데이터와 관련이 있는 것들만) ImageNet에 대해 fine-tuning