18

신경망에서 시그 모이 드 함수의 미분의 역할을 이해하려고합니다.

먼저 시그 모이 드 함수와 파이썬을 사용하여 정의에서 모든 점의 파생을 플로팅합니다. 이 파생 상품의 역할은 정확히 무엇입니까?

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def derivative(x, step):
    return (sigmoid(x+step) - sigmoid(x)) / step

x = np.linspace(-10, 10, 1000)

y1 = sigmoid(x)
y2 = derivative(x, 0.0000000000001)

plt.plot(x, y1, label='sigmoid')
plt.plot(x, y2, label='derivative')
plt.legend(loc='upper left')
plt.show()

machine-learning neural-network

— 루카스
소스

2

더 궁금한 점이 있으면 주저하지 말고 문의하십시오

— JahKnows

23

신경망에서 파생물을 사용하는 것은 역 전파 (backpropagation) 라고하는 훈련 과정에 사용 됩니다. 이 기술은 손실 함수를 최소화하기 위해 최적의 모델 파라미터 세트를 찾기 위해 기울기 하강 을 사용 합니다 . 귀하의 예에서는 시그 모이 드 의 파생물을 사용해야합니다. 왜냐하면 그것은 개별 뉴런이 사용하고있는 활성화이기 때문입니다.

손실 기능

머신 러닝의 본질은 일부 목표 함수를 최소화하거나 최대화 할 수 있도록 비용 함수를 최적화하는 것입니다. 이것을 일반적으로 손실 또는 비용 함수라고합니다. 일반적으로이 기능을 최소화하려고합니다. 비용 함수 는 모델 매개 변수의 함수로 모델을 통해 데이터를 전달할 때 결과 오류에 따라 약간의 페널티를 연관시킵니다. $C$

이미지에 고양이 또는 개가 포함되어 있는지 여부를 레이블로 표시하는 예를 살펴 보겠습니다. 우리가 완벽한 모델을 가지고 있다면, 모델에게 사진을 줄 수 있고 그것이 고양이인지 개인 지 알려줄 것입니다. 그러나 완벽한 모델은 없으며 실수를 할 것입니다.

입력 데이터에서 의미를 유추 할 수 있도록 모델을 훈련시킬 때 실수를 최소화하려고합니다. 우리는 훈련 세트를 사용합니다.이 데이터에는 많은 개와 고양이 사진이 포함되어 있으며 그 이미지와 관련된 기본 정보 라벨이 있습니다. 모델의 학습 반복을 실행할 때마다 모델의 비용 (실수)을 계산합니다. 이 비용을 최소화하고자합니다.

많은 비용 함수가 각각 고유 한 목적을 위해 존재합니다. 일반적으로 사용되는 비용 함수는 다음과 같이 정의 된 2 차 비용입니다.

. $C = \frac{1}{N} \sum_{i=0}^{N}(\hat{y} - y)^2$

이것은 우리가 훈련 한 이미지에 대한 예측 레이블과지면 진실 레이블의 차이의 제곱입니다 . 우리는 이것을 어떤 식 으로든 최소화하려고합니다. $N$

손실 기능 최소화

실제로 대부분의 머신 러닝은 일부 비용 함수를 최소화하여 분포를 결정할 수있는 프레임 워크 제품군입니다. 우리가 요청할 수있는 질문은 "기능을 최소화하는 방법"입니다.

다음 기능을 최소화하자

. $y = x^2-4x+6$

이것을 플로팅하면 최소값이 있음을 알 수 있습니다 . 이를 분석적으로 수행하기 위해이 함수의 미분을 $x = 2$

$\frac{dy}{dx} = 2x - 4 = 0$

입니다. $x = 2$

그러나 분석적으로 전체 최소값을 찾는 것은 종종 불가능합니다. 대신 우리는 몇 가지 최적화 기술을 사용합니다. 여기에는 Newton-Raphson, 그리드 검색 등과 같은 여러 가지 방법이 있습니다. 그 중에서도 경사 하강 입니다. 이것은 신경망에서 사용되는 기술입니다.

그라데이션 하강

이것을 이해하기 위해 널리 사용되는 비유를 사용합시다. 2D 최소화 문제를 상상해보십시오. 이것은 광야에서 산악 하이킹을하는 것과 같습니다. 가장 낮은 지점에있는 마을로 돌아가고 싶습니다. 당신이 마을의 기본 방향을 모르더라도. 당신이해야 할 일은 지속적으로 가장 가파른 길을 걷어 내리면 결국 마을에 도착합니다. 따라서 경사의 가파른 정도를 기준으로 지표면을 내려갑니다.

우리의 기능을 보자

$y = x^2-4x+6$

우리는 를 결정할 것이다 $x$ 가 최소화 되는 를 입니다. 그라디언트 디센트 알고리즘은 먼저 대해 임의의 값을 선택할 것이라고 말합니다 . 초기화합시다 . 그러면 알고리즘은 수렴에 도달 할 때까지 다음을 반복적으로 수행합니다. $y$ $x$ $x=8$

$x^{new} = x^{old} - \nu \frac{dy}{dx}$

여기서, 학습 속도가, 우리는 우리가 원하는 것 무엇이든 값으로 설정할 수 있습니다. 그러나 이것을 선택하는 현명한 방법이 있습니다. 너무 크면 최소값에 도달 할 수 없으며 너무 크면 도착하기 전에 많은 시간을 낭비하게됩니다. 가파른 경사를 내리고 싶은 계단의 크기와 유사합니다. 작은 발걸음과 산에서 죽을 것입니다. 단계가 너무 커서 마을을 쏴서 산의 다른 쪽을 끝내는 위험이 있습니다. 미분 값은이 기울기를 최소값으로 이동시키는 수단입니다. $\nu$

$\frac{dy}{dx} = 2x - 4$

$\nu = 0.1$

반복 1 :

$x^{new} = 8 - 0.1(2 * 8 - 4) = 6.8$
$x^{new} = 6.8 - 0.1(2 * 6.8 - 4) = 5.84$
$x^{new} = 5.84 - 0.1(2 * 5.84 - 4) = 5.07$
$x^{new} = 5.07 - 0.1(2 * 5.07 - 4) = 4.45$
$x^{new} = 4.45 - 0.1(2 * 4.45 - 4) = 3.96$
$x^{new} = 3.96 - 0.1(2 * 3.96 - 4) = 3.57$
$x^{new} = 3.57 - 0.1(2 * 3.57 - 4) = 3.25$
$x^{new} = 3.25 - 0.1(2 * 3.25 - 4) = 3.00$
$x^{new} = 3.00 - 0.1(2 * 3.00 - 4) = 2.80$
$x^{new} = 2.80 - 0.1(2 * 2.80 - 4) = 2.64$
$x^{new} = 2.64 - 0.1(2 * 2.64 - 4) = 2.51$
$x^{new} = 2.51 - 0.1(2 * 2.51 - 4) = 2.41$
$x^{new} = 2.41 - 0.1(2 * 2.41 - 4) = 2.32$
$x^{new} = 2.32 - 0.1(2 * 2.32 - 4) = 2.26$
$x^{new} = 2.26 - 0.1(2 * 2.26 - 4) = 2.21$
$x^{new} = 2.21 - 0.1(2 * 2.21 - 4) = 2.16$
$x^{new} = 2.16 - 0.1(2 * 2.16 - 4) = 2.13$
$x^{new} = 2.13 - 0.1(2 * 2.13 - 4) = 2.10$
$x^{new} = 2.10 - 0.1(2 * 2.10 - 4) = 2.08$
$x^{new} = 2.08 - 0.1(2 * 2.08 - 4) = 2.06$
$x^{new} = 2.06 - 0.1(2 * 2.06 - 4) = 2.05$
$x^{new} = 2.05 - 0.1(2 * 2.05 - 4) = 2.04$
$x^{new} = 2.04 - 0.1(2 * 2.04 - 4) = 2.03$
$x^{new} = 2.03 - 0.1(2 * 2.03 - 4) = 2.02$
$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.02$
$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$

그리고 우리는 알고리즘이 에서 수렴한다는 것을 알 수 있습니다 ! 최소값을 찾았습니다. $x = 2$

신경망에 적용

$x$ $\hat{y}$

$\sigma(z) = \frac{1}{1+exp(z)}$

$\hat{y}(w^Tx) = \frac{1}{1+exp(w^Tx + b)}$

where $w$ is the associated weight for each input $x$ and we have a bias $b$ . We then want to minimize our cost function

$C = \frac{1}{2N} \sum_{i=0}^{N}(\hat{y} - y)^2$ .

How to train the neural network?

We will use gradient descent to train the weights based on the output of the sigmoid function and we will use some cost function $C$ and train on batches of data of size $N$ .

$C = \frac{1}{2N} \sum_i^N (\hat{y} - y)^2$

$\hat{y}$ is the predicted class obtained from the sigmoid function and $y$ is the ground truth label. We will use gradient descent to minimize the cost function with respect to the weights $w$ . To make life easier we will split the derivative as follows

$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w}$ .

$\frac{\partial C}{\partial \hat{y}} = \hat{y} - y$

and we have that $\hat{y} = \sigma(w^Tx)$ and the derivative of the sigmoid function is $\frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1-\sigma(z))$ thus we have,

$\frac{\partial \hat{y}}{\partial w} = \frac{1}{1+exp(w^Tx + b)} (1 - \frac{1}{1+exp(w^Tx + b)})$ .

So we can then update the weights through gradient descent as

$w^{new} = w^{old} - \eta \frac{\partial C}{\partial w}$

where $\eta$ is the learning rate.

— JahKnows
소스

2

please tell me why is this process not so nicely described in books? Do you have a blog? What materials for learning neural networks do you recommend? I have test data and I want to train it. Can I draw a function that I will minimize? I would like to visualize this process to better understand it.

— lukassz

Can you explain backpropagation in this simple way?

— lukassz

1

Amazing Answer...(+1)

— Aditya

1

Backprop is also similar to what JahKnows has Explained above... Its just the gradient is carried all the way to the inputs right from the outputs.. A quick google search will make this clear.. Also the same goes every other activation functions also..

— Aditya

1

@lukassz, notice that his equation is the same as the one I have for the weight update in the before last equation.

\frac{\partial C}{\partial w} = (\hat{y} - y) * derivative of sigmoid

$\frac{\partial C}{\partial w} = (\hat{y} - y) * \text{derivative of sigmoid}$ . He uses the same cost function as me, dont forget that you need to take the derivative of the loss function too, that becomes

\hat{y} - y

$\hat{y} - y$ , where

\hat{y}

$\hat{y}$ are the predicted labels and

y

$y$ are the ground truth labels.

— JahKnows

2

During the phase where the neural network generates its prediction, it feeds the input forward through the network. For each layer, the layer's input $X$ goes first through an affine transformation $W \cdot X + b$ and then is passed through the sigmoid function $σ(W \cdot X + b)$ .

In order to train the network, the output $\hat y$ is then compared to the expected output (or label) $y$ through a cost function $L(y, \hat y)=L\left(y, σ(W \cdot X + b)\right)$ . The goal of the whole training procedure is to minimize that cost function. In order to do that, a technique called gradient descent is performed which calculates how we should change $W$ and $b$ so that the cost reduces.

Gradient Descent requires calculating the derivative of the cost function w.r.t $W$ and $b$ . In order to do that we must apply the chain rule, because the derivative we need to calculate is a composition of two functions. As dictated by the chain rule we must calculate the derivative of the sigmoid function.

One of the reasons that the sigmoid function is popular with neural networks, is because its derivative is easy to compute.

— M Sef
소스

1

In simple words:

Derivative shows neuron's ability to learn on particular input.

For example if input is 0 or 1 or -2, the derivative (the "learning ability") is high and back-propagation will improve neuron's weights for this sample dramatically.

On other hand, if input is 20, the the derivative will be very close to 0. It means that back-propagation on this sample will not "teach" this neuron to produce a better result.

The things above are valid for a single sample.

Let's look at the bigger picture, for all samples in the training set. Here we have several situations:

If derivative is 0 for all samples in your training set AND neuron always produces wrong results - it means the neuron is saturated (dumb) and will not improve.
If derivative is 0 for all samples in your training set AND neuron always produces correct results - it means the neuron have been studying really well and already as smart as it could (side note: this case is good but it may indicate potential overfitting, which is not good)
If derivative is 0 on some samples, non-0 on other samples AND neuron produces mixed results - it indicates that this neuron doing some good work and potentially may improve from further training (though not necessarily as it depends on other neurons and training data you have)

So, when you are looking at the derivative plot, you can see how much the neuron prepared to learn and absorb the new knowledge, given a particular input.

— VeganHunter
소스

0

The derivative you see here is important in neural networks. It's the reason why people generally prefer something else such as rectified linear unit.

Do you see the derivative drop for the two ends? What if your network is on the very left side, but it needs to move to the right side? Imagine you're on -10.0 but you want 10.0. The gradient will be too small for your network to converge quickly. We don't want to wait, we want quicker convergence. RLU doesn't have this problem.

We call this problem "Neural Network Saturation".

Please see https://www.quora.com/What-is-special-about-rectifier-neural-units-used-in-NN-learning

— SmallChess
소스

신경망에서 시그 모이 드 기능의 역할 미분

손실 기능

손실 기능 최소화

그라데이션 하강

신경망에 적용

How to train the neural network?