SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation(2016)

Abstract

SegNet의 구성

encoder
decoder : encoder에서의 max pooling 위치정보를 사용해 non-linear upsampling을 수행한다.
pixel-wise classification layer

SegNet의 설계 목적은 road scene understanding 이며 다음과 같은 것들을 고려하여 설계됨

memory 효율
inference time 효율
학습파라미터 효율
end to end 학습

Introduction

SegNet은 road scene understanding이 목적이므로 road, building같은 appearance나 cars, pedestrians같은 shape를 잘 모델링하고, road와 side-walk같은 다른 클래스간의 spatial relationship (context)를 잘 이해할 수 있도록 설계했다.

Encoder 에서 추출된 image representation으로부터 boundary information을 유지하는 것이 중요하다.

대부분의 픽셀을 차지하는 도로/건물같은 큰 객체들은 smooth segmentation 생성할 수 있어야 하고, 보행자같은 작은 object에 대한 shape도 잘 나타낼 수 있어야 한다.

계산량 관점에선 memory 및 inference time에서 효율적이어야 한다.

Encoder 구성

VGG16과 입력 이후 13개의 convolution layer까지 동일구조
학습 파라미터 줄여 학습 쉽게 하기 위해 마지막 FC Layer 제거

Decoder 구성