역 모드 자동 차별화의 단계별 예

이 질문이 여기에 속하는지 확실하지 않지만 최적화의 그라디언트 방법과 밀접한 관련이 있습니다. 어쨌든 다른 커뮤니티가 주제에 대해 더 나은 전문 지식을 가지고 있다고 생각되면 자유롭게 마이그레이션하십시오.

요컨대, 역 모드 자동 차별화 의 단계별 예제를 찾고 있습니다. 주제에 대한 많은 문헌은 없으며 기존 구현 ( TensorFlow의 것과 같은 )은 그 뒤에있는 이론을 알지 못하면 이해하기 어렵습니다. 따라서 누군가 우리가 전달하는 내용 , 처리 방법 및 계산 그래프 에서 취한 내용을 자세히 보여줄 수 있다면 매우 감사 할 것 입니다.

내가 가장 어려움을 겪고있는 몇 가지 질문 :

씨앗 -왜 우리는 그들을 필요로합니까?
역 분화 규칙을 역전시키는 방법을 알고 있습니다. 예 를 섹션 의 예에서 $\bar{w_2}=\bar{w_3}w_1$ 어떻게 알 수 있습니까?
심볼로만 작업 하거나 실제 값을 통과 합니까? 예를 들어 같은 예 에서 $w_i$ 와 $\bar{w_i}$ 기호 또는 값?

— 친구
소스

"Scikit-Learn & TensorFlow를 사용한 실습 머신 러닝"부록 D는 제 의견으로는 매우 좋은 설명을 제공합니다. 제가 추천합니다.

— Agustin Barrachina

식 $z = x_1x_2 + \sin(x_1)$ 있고 도함수 를 찾고 싶다고 가정 해 봅시다. $\frac{dz}{dx_1}$ 및 $\frac{dz}{dx_2}$ . 리버스 모드 AD는이 작업을 순방향 및 역방향 패스의 두 부분으로 나눕니다.

포워드 패스

먼저 복잡한 표현식을 기본 표현식 세트, 즉 최대 단일 함수 호출로 구성된 표현식으로 분해합니다. 필요하지는 않지만 일관성을 위해 입력 및 출력 변수의 이름을 바꿉니다.

w_{1} = x_{1}

$w_1 = x_1$

w_{2} = x_{2}

$w_2 = x_2$

w_{3} = w_{1} w_{2}

$w_3 = w_1w_2$

w_{4} = \sin (w_{1})

$w_4 = \sin(w_1)$

w_{5} = w_{3} + w_{4}

$w_5 = w_3 + w_4$

z = w_{5}

$z = w_5$

이 표현의 장점은 각각의 개별 표현에 대한 차별화 규칙이 이미 알려져 있다는 것입니다. 예를 들어, $\sin$ 미분 은 $\cos$ 이므로 $\frac{dw_4}{dw_1} = \cos(w_1)$ . 우리는이 사실을 아래의 역순으로 사용할 것입니다.

기본적으로 정방향 패스는 이러한 각 표현식을 평가하고 결과를 저장하는 것으로 구성됩니다. 입력 값은 $x_1 = 2$ 및 $x_2 = 3$ 입니다. 그럼 우리는 :

w_{1} = x_{1} = 2

$w_1 = x_1 = 2$

w_{2} = x_{2} = 3

$w_2 = x_2 = 3$

w_{3} = w_{1} w_{2} = 6

$w_3 = w_1w_2 = 6$

w_{4} = \sin (w_{1}) = 0.9

$w_4 = \sin(w_1) ~= 0.9$

w_{5} = w_{3} + w_{4} = 6.9

$w_5 = w_3 + w_4 = 6.9$

z = w_{5} = 6.9

$z = w_5 = 6.9$

리버스 패스

이것은 마법의 시작이며 연쇄 규칙으로 시작합니다 . 기본적인 형태, 체인 규칙은 변수가 있다면한다고 $t(u(v))$ 하는가에 따라 $u$ 의 차례에 의존하는, $v$ 후 :

\frac{d t}{d v} = \frac{d t}{d u} \frac{d u}{d v}

$\frac{dt}{dv} = \frac{dt}{du}\frac{du}{dv}$

또는, 만약 $t$ 에 따라 $v$ 다수의 경로 / 변수를 통해 $u_i$ 예 :

u_{1} = f (v)

$u_1 = f(v)$

u_{2} = g (v)

$u_2 = g(v)$

t = h (u_{1}, u_{2})

$t = h(u_1, u_2)$

그런 다음 ( 여기에서 증거 참조 ) :

\frac{d t}{d v} = \sum_{i} \frac{d t}{d u_{i}} \frac{d u_{i}}{d v}

$\frac{dt}{dv} = \sum_i \frac{dt}{du_i}\frac{du_i}{dv}$

$z$ $w_i$ $z$ $w_i$ $w_p$ $z = g(w_p)$ where $w_p = f(w_i)$ ), we can find derivative $\frac{dz}{dw_i}$ as

\frac{d z}{d w_{i}} = \sum_{p \in p a r e n t s (i)} \frac{d z}{d w_{p}} \frac{d w_{p}}{d w_{i}}

$\frac{dz}{dw_i} = \sum_{p \in parents(i)} \frac{dz}{dw_p} \frac{dw_p}{dw_i}$

다시 말해, 출력 변수 $z$ 의 미분을 중간 또는 입력 변수 $w_i$ 에서 계산하려면 부모의 미분과 기본 표현의 미분을 계산하는 공식 만 알아야합니다. $w_p = f(w_i)$ .

리버스 패스는 끝에서 시작합니다 (예 : $\frac{dz}{dz}$

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$

$z$ $z$

$z = w_5$

\frac{d z}{d w_{5}} = 1

$\frac{dz}{dw_5} = 1$

$w_5$ $w_3$ and $w_4$ , so $\frac{dw_5}{dw_3} = 1$ and $\frac{dw_5}{dw_4} = 1$ . Using the chain rule we find:

\frac{d z}{d w_{3}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{3}} = 1 \times 1 = 1

$\frac{dz}{dw_3} = \frac{dz}{dw_5} \frac{dw_5}{dw_3} = 1 \times 1 = 1$

\frac{d z}{d w_{4}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{4}} = 1 \times 1 = 1

$\frac{dz}{dw_4} = \frac{dz}{dw_5} \frac{dw_5}{dw_4} = 1 \times 1 = 1$

From definition $w_3 = w_1w_2$ and rules of partial derivatives, we find that $\frac{dw_3}{dw_2} = w_1$ . Thus:

\frac{d z}{d w_{2}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{2}} = 1 \times w_{1} = w_{1}

$\frac{dz}{dw_2} = \frac{dz}{dw_3} \frac{dw_3}{dw_2} = 1 \times w_1 = w_1$

Which, as we already know from forward pass, is:

\frac{d z}{d w_{2}} = w_{1} = 2

$\frac{dz}{dw_2} = w_1 = 2$

Finally, $w_1$ contributes to $z$ via $w_3$ and $w_4$ . Once again, from the rules of partial derivatives we know that $\frac{dw_3}{dw_1} = w_2$ and $\frac{dw_4}{dw_1} = \cos(w_1)$ . Thus:

\frac{d z}{d w_{1}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{1}} + \frac{d z}{d w_{4}} \frac{d w_{4}}{d w_{1}} = w_{2} + \cos (w_{1})

$\frac{dz}{dw_1} = \frac{dz}{dw_3} \frac{dw_3}{dw_1} + \frac{dz}{dw_4} \frac{dw_4}{dw_1} = w_2 + \cos(w_1)$

And again, given known inputs, we can calculate it:

\frac{d z}{d w_{1}} = w_{2} + \cos (w_{1}) = 3 + \cos (2) = 2.58

$\frac{dz}{dw_1} = w_2 + \cos(w_1) = 3 + \cos(2) ~= 2.58$

Since $w_1$ and $w_2$ are just aliases for $x_1$ and $x_2$ , we get our answer:

\frac{d z}{d x_{1}} = 2.58

$\frac{dz}{dx_1} = 2.58$

\frac{d z}{d x_{2}} = 2

$\frac{dz}{dx_2} = 2$

And that's it!

This description concerns only scalar inputs, i.e. numbers, but in fact it can also be applied to multidimensional arrays such as vectors and matrices. Two things that one should keep in mind when differentiating expressions with such objects:

Derivatives may have much higher dimensionality than inputs or output, e.g. derivative of vector w.r.t. vector is a matrix and derivative of matrix w.r.t. matrix is a 4-dimensional array (sometimes referred to as a tensor). In many cases such derivatives are very sparse.
Each component in output array is an independent function of 1 or more components of input array(s). E.g. if $y = f(x)$ and both $x$ and $y$ are vectors, $y_i$ never depends on $y_j$ , but only on subset of $x_k$ . In particular, this means that finding derivative $\frac{dy_i}{dx_j}$ boils down to tracking how $y_i$ depends on $x_j$ .

The power of automatic differentiation is that it can deal with complicated structures from programming languages like conditions and loops. However, if all you need is algebraic expressions and you have good enough framework to work with symbolic representations, it's possible to construct fully symbolic expressions. In fact, in this example we could produce expression $\frac{dz}{dw_1} = w_2 + \cos(w_1) = x_2 + \cos(x_1)$ and calculate this derivative for whatever inputs we want.

— ffriend
소스

Very useful question/answer. Thanks. Just a litte criticism: you seem to move on a tree structure without explaining (that's when you start talking about parents, etc..)

— MadHatter

Also it won't hurt clarifying why we need seeds.

— MadHatter

@MadHatter thanks for the comment. I tried to rephrase a couple of paragraphs (these that refer to parents) to emphasize a graph structure. I also added "seed" to the text, although this name itself may be misleading in my opinion: in AD seed is always a fixed expression -

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$ , not something you can choose or generate.

— ffriend

Thanks! I noticed when you have to set more than one "seed", generally one chooses 1 and 0. I'd like to know why. I mean, one takes the "quotient" of a differential w.r.t. itself, so "1" is at least intuitively justified.. But what about 0? And what if one has to pick more than 2 seeds?

— MadHatter

As far as I understand, more than one seed is used only in forward-mode AD. In this case you set the seed to 1 for an input variable you want to differentiate with respect to and set the seed to 0 for all the other input variables so that they don't contribute to the output value. In reverse-mode you set the seed to an output variable, and you normally have only one output variable. I guess, you can construct reverse-mode AD pipeline with several output variables and set all of them but one to 0 to get the same effect as in forward mode, but I have never investigated this option.

— ffriend