Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation (CVPR2019)

Abstract

최근 semantic segmentation 모델들 → encoder-decoder 구조

decoder의 마지막 레이어는 보통 bilinear unsampling이었다. 이에 대한 문제를 제기

bilinear는 너무 simple하고 data에 독립적이다. 이것이 최적의 결과를 가져오지 못하게 할 수 있다고 주장하며 새로운 upsampling 방법인 DUpsampling을 제안한다. DUpsampling은 date-dependent하다. 즉, 데이터에 맞추어 학습이 가능하다.

DUpsampling은 semantic segmentation 모델들의 반복성을 통해 bilinear를 쉽게 대체 가능하다.

또한 복잡도 증가 없이 작은 해상도의 feature maps로부터 predict label map을 생성하고, 더 나은 정확도를 달성한다.

이를 달성할 수 있었던 원인은

DUpsampling은 reconstruction capability를 매우 향상시킨다.
CNN encoder의 arbitrary combination으로 인한 DUpsampling을 기반으로 설계된 decoder의 유연성의 증가

Introduction

[FCN부터의 segmentation의 발전]

FCN : 각 픽셀에 대한 다양한 클래스에 대한 예측 성공. 그러나 여러 stage에 걸친 strided conv와 spatial pooling으로 final image prediction이 1/32로 줄어들고 fine image structure information을 잃어 정확도가 떨어지게 된다. 특히 객체의 경계가 심하다.
DeepLab : atrous convolutions를 통해 large receptive를 가질 수 있도록 하고 피쳐맵의 큰 해상도를 유지했다. encoder-decoder 구조를 활용했다.

encoder-decoder구조는 backbone CNN을 encoder로 보고, raw input image를 작은 해상도의 feature map으로 encoding할 수 있는 능력을 가졌다라고 가정한다.

그 후, decoder가 작은 해상도의 피쳐맵으로부터 pixel-wise prediction을 복원한다. 이전연구의 decoder들은 몇개 없는 conv 레이어들과 bilinear upsampling을 가진다.

lightweight decoder같은 경우에는 큰 해상도의 featuremap과 bilinear upsampling을 가진다는 특징이 있다.

decoder는 보통 convolution 연산과 pooling으로 인해 잃은 fine-grained information을 캡쳐하기 위해 low-level features를 fuse한다.

[기존 segmentation의 제약]

decoder에서 input size의 1/4 , 1/8 등의 비율로 작은 feature maps로부터 input 크기의 prediction을 생성해야한다는 요구는 두가지 issue를 발생시킨다