StyleT2I:Toward Compositional and High-Fidelity Text-to-Image Synthesis ( CVPR 2022 )

Abstract

test-to-image synthesis

이전의 연구들에서 부족한 점 : input text에서 불충분하게 표현된 특징에 대한 요소 혹은 학습한 적 없는 ( unseen ) 요소 에 대한 일반화가 부족했다

ex)인구통계학적으로 적은 그룹의 얼굴

이 논문에서는

StyleT2I라는 test-to-image synthesis에서 compositionality를 향상시키는 framework를 제안한다
CLIP-guided Contrastive Loss라는 다른 문장들에서 다른 요소들을 더 잘 식별하는 loss를 제안한다
compositionality를 더 향상시키기 위해 새로운 Semantic Matching Loss와 의도한 공간적 영역 조적을 위한 attributes의 latent directions를 확인하기 위한 Spatial Constraint을 제안 → 더 잘 disentangled된 attribute의 spatial representations를 결과로 가져옴
확인된 latent directions of attributes에 기반해 Compositional Attribute Adjustment를 제안해 latent code를 조정하고, image synthesis에서 더 좋은 compositionality를 결과로 가져온다
추가적으로 image-text alignment과 image fidelity 사이의 균형을 맞추기 위해 식별된 latent directions(norm penalry)의 L2-norm 정규화(regularization)를 활용한다.

Text-to-image 합성은 여러 분야에서 활용되고 있지만 compositionality라는 측면이 간과되고 있었다.

예시) “ He is wearing lipstick “ 그는 립스틱을 바르고있다. 이 문장의 (He, Lipstick) 이 attributes의 조합은 face dataset에서 underrepresented, 즉 충분하지 않게 존재한다.

이전의 방식들은 이러한 이미지를 정확히 합성하지 못한다. 아마도 overrepresented compositions에 overfitting되었기 때문일 것이다. 쉽게 말하면 데이터에 많이 존재하는 조합들만 잘 합성된다

overrepresented composition의 예시 : (”she”, “wearing lipstick”), (”he”, not “wearing lipstick”)

⇒ dataset 으로부터 얻게되는 biases / stereotypes

⇒ Severe Robustness and Fairness Issues 발생시킴

따라서 이러한 조합들을 그저 ‘암기'하는 것이 아니도록 해야한다.

이를 위해서