[작성중]Pytorch에서 Mixed Precision 사용하기

이전 블로그에서 작성했던 글

[PyTorch] model.half()

Pytorch 공식

torch.cuda.amp는 mixed precision을 위한 편리한 기능을 제공하여 준다

몇몇 연산은 torch.float32(float) datatype을 사용하고, 다른 연산들은 torch.float16(half) 연산을 사용하도록

Untitled

Mixed precisoin은 각 연산에 그것에 적절한 datatype을 적용하도록 하고 네트워크의 runtime과 memory footprint를 감소시켜준다

Pytorch에서 automatic mixed precision training은 일반적으로 torch.cuda.amp.autocast와 torch.cuda.amp.GradScaler을 함께 씀을 의미한다

torch.cuda.amp.autocast : 선택한 영역에서 autocasting을 가능하게 한다. autocasting은 자동으로 acc는 유지하면서 performance는 향상킬 GPU연산을 위한 precision을 선택한다.
torch.cuda.amp.GradScaler : 편리하게 gradient scaling의 step을 수행하는 것을 돕는다. 이는 float16 fradients로 gradient underflow를 감소시킴으로써 네트워크의 수렴을 향상시킨다. forwardpass에서 특정 연산이 float 16이라면 이 연산을 위한 backward pass는 float16 gradient를 가질 것이다. 매우 작은 값의 gradient 값은 float16에서 충분하게 표현되지 않기 때문에 underflow가 발생할 수 있다. 따라서 연관 파라미터의 업데이트가 이뤄지지 않는다. 이를 방지하기 위해 gradient scaling이 네트워크의 loss를 scale factor에 의해 곱하고 backward pass를 scaled loss에서 수행한다. backward pass를 흐르는 gradient들은 이후 다시 같은 factor 로 scale된다. 다시말해, gradient값들은 더 큰 값을 갖게되어 zero로 flush되지 않는다. ⚠️ 각 파라미터의 gradients들은 ( .grad ) optimizer 업데이트 이전에 unscaled되어야한다. 그래서 scale factor가 learning rate에 영향을 미치지 않도록 해야한다.
Underflow와 Overflow

Mixed Precision은 주로 Tensor Core-enabled architectures ( Volta, Turing, Ampere ) 에서 이점이 있다 ( Tensor core의 연산 자체가 FP16 ). 보통 2-3배 빨라질 것이다

이전의 아키텍쳐들 ( Kepler, Maxwell, Pascal ) 은 큰 속도 향상은 없을 수도 있다.

일반적인 Mixed Precision Training