Pay Less Attention with LightWeight and Dynamic Convolution

<aside> 💡 NLP는 잘 모르는 분야라 잘못된 설명이 있을 수 있습니다

</aside>

self-attention 은 image / language 위한 generative model에 유용하다.

이 논문에서는

매우 가벼운 weight의 conv연산으로 현 best self attention 에 맞먹는 효과를 내보겠다 ( 2019 기준 )
self attention 보다 더 효과적이고 심플한 dynamic conv를 소개한다

여기서 self attention 은 “Attention is All You Need“ 논문의 transformer구조에 나오는 attention을 말한다.

“Attention is All you Need”의 Attention

Self attention은

attention weights가 현 time step에서 context내의 모든 요소들과 비교해 계산된다
제한되지 않은 context sizes에서 비교를 연산할 수 있는 능력이 주요 특징이다
- 이 부분은 Sequence modeling된 모델들은 순차적으로 입력이 들어가기 때문에 문장이 길다면 hidden state가 충분한 정보를 담지 못할 수도 있고, 메모리도 많이들기 때문에 무리가 된다. 이부분과 비교하는 부분인 듯 하다

그러나 self attention 의 능력이 long range dependencies일지라도 무한한 context size는 quadratic complexity때문에 연산적으로 매우 어려움의 문제가 있다.

이 논문에서는

depth-wise separable conv, softmax-normalized, 채널간 weights공유를 사용한 lightweight convolution을 소개한다

결과적으로 더 적은 weights가 일반 conv보다 더 일반화가 잘 됨을 보여주고, 현 time step에 상관없이 context 요소들에 대해 같은 weights를 다시 사용한다.

Dynamic Convolution은 이 lightweight convolution으로 이뤄졌고 모든 time step에서 다른 kernel을 학습한다. 이 kernel은 전체 context에 대한 맥락인 self attention과 반대로 현 time step 의 함수이다

Dynamic conv는 모든 position에서 변한다는 점에서 locally connected layers와 유사하다

그러나 training이후 fixed되는 것이 아닌 model 에 의해 dynamically generated된다는 점이 다른점이다.

Sequence to Sequence learning