6. Automatic Differentiation with torch.autograd

신경망을 학습할 때 가장 자주 사용되는 알고리즘은 역전파입니다. 이 알고리즘에서는 주어진 매개변수에 대한 손실 함수의 그래디언트(기울기)에 따라 매개변수(모델 가중치)가 조정됩니다.

When training neural networks, the most frequently used algorithm is back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

이러한 그래디언트를 계산하기 위해 PyTorch에는 torch.autograd라는 내장 미분 엔진이 있습니다. 모든 계산 그래프에 대한 그래디언트 자동 계산을 지원합니다.

To compute those gradients, PyTorch has a built-in differentiation engine called torch.autograd. It supports automatic computation of gradient for any computational graph.

입력 x, 매개변수 w 및 b, 일부 손실 함수가 있는 가장 단순한 1 레이어 신경망을 생각해 보세요. 다음과 같은 방식으로 PyTorch에서 정의할 수 있습니다:

Consider the simplest one-layer neural network, with input x, parameters w and b, and some loss function. It can be defined in PyTorch in the following manner:

import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

Tensors, Functions and Computational graph

이 코드는 다음 계산 그래프를 정의합니다:

This code defines the following computational graph:

이 네트워크에서 w와 b는 최적화해야 하는 매개변수입니다. 따라서 이러한 변수에 대한 손실 함수의 그래디언트를 계산할 수 있어야 합니다. 이를 위해 해당 텐서의 require_grad 속성을 설정합니다.

In this network, w and b are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. In order to do that, we set the requires_grad property of those tensors.

Note

텐서를 생성할 때 또는 나중에 x.requires_grad_(True) 메서드를 사용하여 requires_grad 값을 설정할 수 있습니다.

You can set the value of requires_grad when creating a tensor, or later by using x.requires_grad_(True) method.

계산 그래프를 구성하기 위해 텐서에 적용하는 함수는 사실 Function 클래스의 객체입니다. 이 객체는 순방향으로 함수를 계산하는 방법과 역방향 전파 단계에서 도함수를 계산하는 방법을 알고 있습니다. 역방향 전파 함수에 대한 참조는 텐서의 grad_fn 속성에 저장됩니다.

A function that we apply to tensors to construct computational graph is in fact an object of class Function. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in grad_fn property of a tensor.

CodeOut

print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7fc0f6683340>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7fc0f66830a0>

Computing Gradients

신경망에서 매개변수의 가중치를 최적화하려면 매개변수에 대한 손실 함수의 도함수를 계산해야 합니다. 즉, x와 y의 고정 값 아래에 \(\frac{\partial \operatorname{loss}}{\partial w}\)와 \(\frac{\partial \operatorname{loss}}{\partial b}\)가 필요합니다. 이러한 도함수를 계산하기 위해 loss.backward()를 호출한 다음 w.grad 및 b.grad에서 값을 검색합니다:

To optimize weights of parameters in the neural network, we need to compute the derivatives of our loss function with respect to parameters, namely, we need \(\frac{\partial \operatorname{loss}}{\partial w}\) and \(\frac{\partial \operatorname{loss}}{\partial b}\) under some fixed values of x and y. To compute those derivatives, we call loss.backward(), and then retrieve the values from w.grad and b.grad:

CodeOut

loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.2750, 0.0591, 0.2190],
        [0.2750, 0.0591, 0.2190],
        [0.2750, 0.0591, 0.2190],
        [0.2750, 0.0591, 0.2190],
        [0.2750, 0.0591, 0.2190]])
tensor([0.2750, 0.0591, 0.2190])

Note

require_grad 속성이 True로 설정된 계산 그래프의 리프 노드에 대한 grad 속성만 얻을 수 있습니다. 그래프의 다른 모든 노드에서는 그래디언트를 사용할 수 없습니다.
성능상의 이유로 주어진 그래프에서 한 번만 backward를 사용하여 그래디언트 계산을 수행할 수 있습니다. 동일한 그래프에서 backward 호출을 여러 번 수행해야 하는 경우 retain_graph=True를 backward에 전달해야 합니다.

We can only obtain the grad properties for the leaf nodes of the computational graph, which have requires_grad property set to True. For all other nodes in our graph, gradients will not be available.

We can only perform gradient calculations using backward once on a given graph, for performance reasons. If we need to do several backward calls on the same graph, we need to pass retain_graph=True to the backward call.

Disabling Gradient Tracking

기본적으로, require_grad=True인 모든 텐서는 계산 기록을 추적하고 그래디언트 계산을 지원합니다. 그러나 그렇게 할 필요가 없는 경우가 있습니다. 예를 들어 모델을 학습하고 일부 입력 데이터에 적용하려는 경우, 즉 네트워크를 통해 순방향 계산만 수행하려는 경우가 있습니다. 계산 코드를 torch.no_grad() 블록으로 둘러싸서 계산 추적을 중지할 수 있습니다:

By default, all tensors with requires_grad=True are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that, for example, when we have trained the model and just want to apply it to some input data, i.e. we only want to do forward computations through the network. We can stop tracking computations by surrounding our computation code with torch.no_grad() block:

CodeOut

z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False

동일한 결과를 얻는 또 다른 방법은 텐서에서 detach() 메서드를 사용하는 것입니다:

Another way to achieve the same result is to use the detach() method on the tensor:

CodeOut

z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

False

그래디언트 추적을 비활성화하려는 이유는 다음과 같습니다:

신경망의 일부 매개변수를 고정 매개변수로 표시하기 위함입니다. 이것은 사전 학습된 네트워크를 미세 조정하는 매우 일반적인 시나리오입니다.
그래디언트를 추적하지 않는 텐서에 대한 계산이 더 효율적이기 때문에 순방향 패스만 수행할 때 계산 속도를 높이기 위함입니다.

There are reasons you might want to disable gradient tracking:

To mark some parameters in your neural network as frozen parameters. This is a very common scenario for finetuning a pretrained network

To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

More on Computational Graphs

개념적으로 autograd는 Function 객체로 구성된 방향성 비순환 그래프(DAG)에서 데이터(텐서) 및 실행된 모든 작업(결과로 생성되는 새 텐서와 함께)의 레코드를 유지합니다. 이 DAG에서 리프는 입력 텐서이고 루트는 출력 텐서입니다. 이 그래프를 루트에서 리프까지 추적하면 체인 규칙을 사용하여 그래디언트를 자동으로 계산할 수 있습니다.

Conceptually, autograd keeps a record of data (tensors) and all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

순방향 패스에서 autograd는 두 가지 작업을 동시에 수행합니다:

요청된 작업을 실행하여 결과 텐서를 계산합니다.
DAG에서 작업의 그래디언트 함수를 유지합니다.

In a forward pass, autograd does two things simultaneously:

run the requested operation to compute a resulting tensor

maintain the operation’s gradient function in the DAG

DAG 루트에서 .backward()가 호출되면 역방향 패스가 시작됩니다. 그런 다음 autograd는:

각 .grad_fn에서 그래디언트를 계산합니다.
계산된 그래디언트를 각 텐서의 .grad 속성에 누적합니다.
체인 규칙을 사용하여 리프 텐서까지 전파합니다.

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

computes the gradients from each .grad_fn,

accumulates them in the respective tensor’s .grad attribute

using the chain rule, propagates all the way to the leaf tensors.

Note

DAG는 PyTorch에서 동적입니다. 주목해야 할 중요한 점은 그래프가 처음부터 다시 생성된다는 것입니다. 각 .backward() 호출 후 autograd는 새 그래프를 채우기 시작합니다. 이것이 바로 모델에서 제어 흐름 문을 사용할 수 있게 해주는 것입니다. 필요한 경우 모든 반복에서 모양, 크기 및 작업을 변경할 수 있습니다.

DAGs are dynamic in PyTorch. An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.

Optional Reading: Tensor Gradients and Jacobian Products

많은 경우에 스칼라 손실 함수가 있고 일부 매개변수에 대한 그래디언트를 계산해야 합니다. 그러나 출력 함수가 임의의 텐서인 경우가 있습니다. 이 경우 PyTorch를 사용하면 실제 그래디언트가 아닌 소위 Jacobian 곱을 계산할 수 있습니다.

In many cases, we have a scalar loss function, and we need to compute the gradient with respect to some parameters. However, there are cases when the output function is an arbitrary tensor. In this case, PyTorch allows you to compute so-called Jacobian product, and not the actual gradient.

\(\vec{x}=\left\langle x_1, \ldots, x_n\right\rangle\)와 \(\vec{y}=\left\langle y_1, \ldots, y_m\right\rangle\)인 벡터 함수 \(\vec{y}=f(\vec{x})\)의 경우 \(\vec{x}\)에 대한 \(\vec{y}\)의 기울기는 다음과 같이 Jacobian 행렬로 지정됩니다:

For a vector function \(\vec{y}=f(\vec{x})\), where \(\vec{x}=\left\langle x_1, \ldots, x_n\right\rangle\) and \(\vec{y}=\left\langle y_1, \ldots, y_m\right\rangle\), a gradient of \(\vec{y}\) with respect to \(\vec{x}\) is given by Jacobian matrix:

\[ J=\left(\begin{array}{ccc} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array}\right) \]

Jacobian 행렬 자체를 계산하는 대신 PyTorch를 사용하면 주어진 입력 벡터 \(v=\left(v_1 \ldots v_m\right)\)에 대해 Jacobian 곱 \(v^T \cdot J\)를 계산할 수 있습니다. 이는 \(v\)를 인수로 사용하여 backward를 호출하여 달성됩니다. \(v\)의 크기는 곱을 계산하려는 원래 텐서의 크기와 같아야 합니다:

Instead of computing the Jacobian matrix itself, PyTorch allows you to compute Jacobian Product \(v^T \cdot J\) for a given input vector \(v=\left(v_1 \ldots v_m\right)\). This is achieved by calling backward with \(v\) as an argument. The size of \(v\) should be the same as the size of the original tensor, with respect to which we want to compute the product:

CodeOut

inp = torch.eye(4, 5, requires_grad=True)
out = (inp+1).pow(2).t()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"First call\n{inp.grad}")
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")
inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

First call
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

Second call
tensor([[8., 4., 4., 4., 4.],
        [4., 8., 4., 4., 4.],
        [4., 4., 8., 4., 4.],
        [4., 4., 4., 8., 4.]])

Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
        [2., 4., 2., 2., 2.],
        [2., 2., 4., 2., 2.],
        [2., 2., 2., 4., 2.]])

동일한 인수를 사용하여 두 번째로 backward를 호출하면 그래디언트 값이 달라집니다. 이는 backward 전파를 수행할 때 PyTorch가 그래디언트를 누적하기 때문에 발생합니다. 즉, 계산된 그래디언트 값이 계산 그래프의 모든 리프 노드의 grad 속성에 추가됩니다. 적절한 그래디언트를 계산하려면 먼저 grad 속성을 0으로 만들어야 합니다. 실생활 학습에서 옵티마이저는 우리가 이것을 하도록 도와줍니다.

Notice that when we call backward for the second time with the same argument, the value of the gradient is different. This happens because when doing backward propagation, PyTorch accumulates the gradients, i.e. the value of computed gradients is added to the grad property of all leaf nodes of computational graph. If you want to compute the proper gradients, you need to zero out the grad property before. In real-life training an optimizer helps us to do this.

Note

이전에는 매개변수 없이 backward() 함수를 호출했습니다. 이는 본질적으로 backward(torch.tensor(1.0))를 호출하는 것과 동일하며, 신경망 학습 중 손실과 같은 스칼라 값 함수의 경우 그래디언트를 계산하는 유용한 방법입니다.

Previously we were calling backward() function without parameters. This is essentially equivalent to calling backward(torch.tensor(1.0)), which is a useful way to compute the gradients in case of a scalar-valued function, such as loss during neural network training.

References

https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html