tanh가 활성화 함수로 sigmoid보다 거의 항상 더 나은 이유는 무엇입니까?

33

Coursera 의 Andrew Ng의 Neural Networks and Deep Learning 과정에서 그는 $tanh$ 를 사용하는 것이 를 사용하는 것보다 거의 항상 바람직 하다고 말합니다 . $sigmoid$

그가 준 이유는 's 0.5 대신 center를 사용하는 출력이 약 0이기 때문에 "다음 층에 대한 학습이 조금 더 쉬워진다"는 것입니다. $tanh$ $sigmoid$

활성화의 출력 속도를 중심으로하는 이유는 무엇입니까? 나는 backprop 동안 학습이 발생할 때 이전 레이어를 참조한다고 가정합니까?
선호 하는 다른 기능이 있습니까? 기울기가 가파를수록 사라지는 기울기가 지연됩니까? $tanh$
이 선호 되는 상황이 있습니까? $sigmoid$

수학적으로 가볍고 직관적 인 답변이 선호됩니다.

— 톰 헤일
소스

13

S 자형 함수는 S 자 (따라서 이름)입니다. 아마도 당신은 물류 기능에 대해 얘기

\frac{e^{x}}{1 + e^{x}}

$\frac{e^x}{1+e^x}$ . 규모와 위치를 제외하고는 둘 다 본질적으로 동일합니다 :

logistic (x) = \frac{1}{2} + \frac{1}{2} \tanh (\frac{x}{2})

$\text{logistic}(x)=\frac12 +\frac12\tanh(\frac{x}2)$ . 따라서 실제 선택은 구간

(- 1, 1)

$(-1,1)$ 또는 구간

에서 출력을 원하는지 여부입니다.

(0, 1)

$(0,1)$

— Henry

21

Yan LeCun과 다른 사람들은 Efficient BackProp 에서

훈련 세트에 대한 각 입력 변수의 평균이 0에 가까워지면 수렴이 일반적으로 더 빠릅니다. 이를 확인하려면 모든 입력이 양수인 극단적 인 경우를 고려하십시오. 첫 번째 가중치 계층에서 특정 노드에 대한 가중치는 $\delta x$ 비례하는 양만큼 업데이트됩니다. 여기서 $\delta$ 는 해당 노드에서 (스칼라) 오류이고 $x$ 는 입력 벡터입니다 (식 (5) 및 (10) 참조). 입력 벡터의 모든 구성 요소가 양수이면 노드에 공급되는 모든 가중치 업데이트는 동일한 부호 (예 : 부호 ( $\delta$ ))를 갖습니다 . 결과적으로 이러한 가중치는 모두 함께 감소하거나 모두 증가 할 수 있습니다주어진 입력 패턴에 대해. 따라서 가중치 벡터가 방향을 변경해야하는 경우 비효율적이고 매우 느린 지그재그로만 변경할 수 있습니다.

그렇기 때문에 평균이 0이되도록 입력을 정규화해야합니다.

중간 계층에도 동일한 논리가 적용됩니다.

이 휴리스틱은 모든 레이어에 적용되어야합니다. 즉, 출력 은 다음 레이어에 대한 입력이므로 노드 의 평균 출력이 0에 가까워 지길 원합니다 .

Postscript @craq는이 인용 부호가 널리 사용되는 활성화 함수가 된 ReLU (x) = max (0, x)에 대해 의미가 없다는 점을 지적합니다. ReLU는 LeCun이 언급 한 첫 번째 지그재그 문제를 피하지만 LeCun은이 평균을 0으로 올리는 것이 중요한 두 번째 문제를 해결하지는 않습니다. 나는 LeCun이 이것에 대해 무엇을 말해야하는지 알고 싶습니다. 어쨌든 LeCun의 연구 위에 구축 된 Batch Normalization 이라는 논문 이 있습니다.이 문제를 해결하는 방법을 제공합니다.

입력이 희게되면 네트워크 훈련이 더 빨리 수렴되는 것으로 오랫동안 알려져왔다 (LeCun et al., 1998b; Wiesler & Ney, 2011). 각 층이 아래 층들에 의해 생성 된 입력을 관찰함에 따라, 각 층의 입력의 동일한 미백을 달성하는 것이 유리할 것이다.

그건 그렇고, Siraj 의이 비디오 는 10 분 안에 활성화 기능에 대해 많이 설명합니다.

@elkout은 "tanh가 sigmoid (...)에 비해 선호되는 실제 이유는 tanh의 유도체가 sigmoid의 유도체보다 크기 때문"이라고 말합니다.

나는 이것이 문제가 아니라고 생각합니다. 나는 이것이 문헌에서 문제인 것을 본 적이 없다. 하나의 파생물이 다른 파생물보다 작다는 것을 귀찮게한다면, 그냥 확장 할 수 있습니다.

로지스틱 함수의 모양은 $\sigma(x)=\frac{1}{1+e^{-kx}}$ . 일반적으로 $k=1$ 을 사용하지만문제가있는 경우 파생 상품을 더 넓게 만들기위해 $k$ 에다른 값을 사용하는 것을 금지하는 것은 없습니다.

Nitpick : tanh는 S 자형 함수입니다. S 모양의 함수는 S 자형입니다. 여러분이 sigmoid라고 부르는 것은 로지스틱 함수입니다. 물류 기능이 더 인기있는 이유는 역사적 이유입니다. 통계 학자들이 오랫동안 사용했습니다. 게다가 어떤 사람들은 그것이 생물학적으로 그럴듯하다고 생각합니다.

— 리카르도 크루즈
소스

1

, 고등학교 미적분학 이라는 것을 나타 내기 위해 인용이 필요하지 않습니다 .

이기 때문에 이것이 사실임을 알고 있으므로 오목한 2 차법을 최대화해야합니다.

max_{x} σ^{'} (x) < max_{x} \tanh^{'} (x)

$\max_x \sigma^\prime(x) < \max_x \tanh^\prime(x)$

σ^{'} (x) = σ (x) (1 - σ (x)) \leq 0.25

$\sigma^\prime(x) = \sigma(x) (1 - \sigma(x)) \le 0.25$

0 < σ (x) < 1

$0 < \sigma(x) < 1$

이며 검사를 통해 확인할 수 있습니다.

\tanh^{'} (x) = {sech}^{2} (x) = \frac{2}{\exp (x) + \exp (- x))} \leq 1.0

$\tanh^\prime(x) = \text{sech}^2(x) = \frac{2}{\exp(x) + \exp(-x))} \le 1.0$

— Sycorax는 Reinstate Monica가

Apart from that I said that in most cases the derivatives of tanh are larger than the derivatives of the sigmoid. This happens mostly when we are around 0. You are welcome to have a look at this link and at the clear answers provided here question which they also state that the derivates of

\tanh

$\tanh$ are usually larger than the derivates of the

sigmoid

$\text{sigmoid}$ .

— ekoulier

hang on... that sounds plausible, but if middle layers should have an average output of zero, how come ReLU works so well? Isn't that a contradiction?

— craq

@ekoulier, the derivative of

tanh

$\text{tanh}$ being larger than

sigmoid

$\text{sigmoid}$ is a non-issue. You can just scale it if it bothers you.

— Ricardo Cruz

@craq, good point, I think that's a flaw in LeCun's argument indeed. I have added a link to the batch normalization paper where it discusses more about that issue and how it can be ameliorated. Unfortunately, that paper doesn't compare relu with tanh, it only compares relu with logistic (sigmoid).

— Ricardo Cruz

14

It's not that it is necessarily better than $\text{sigmoid}$ . In other words, it's not the center of an activation fuction that makes it better. And the idea behind both functions is the same, and they also share a similar "trend". Needless to say that the $\tanh$ function is called a shifted version of the $\text{sigmoid}$ function.

The real reason that $\text{tanh}$ is preferred compared to $\text{sigmoid}$ , especially when it comes to big data when you are usually struggling to find quickly the local (or global) minimum, is that the derivatives of the $\text{tanh}$ are larger than the derivatives of the $\text{sigmoid}$ . In other words, you minimize your cost function faster if you use $\text{tanh}$ as an activation fuction.

But why does the hyperbolic tangent have larger derivatives? Just to give you a very simple intuition you may observe the following graph:

The fact that the range is between -1 and 1 compared to 0 and 1, makes the function to be more convenient for neural networks. Apart from that, if I use some math, I can prove that:

\tanh x = 2 σ (2 x) - 1

$\tanh{x} = 2σ(2x)-1$

And in general, we may prove that in most cases $\Big|\frac{\partial\tanh (x)}{\partial x}\Big| > \Big|\frac{\partial\text{σ} (x)}{\partial x}\Big|$ .

— ekoulier
소스

So why would Prof. Ng say that it's an advantage to have the output of the function averaging around

0

$0$ ?

— Tom Hale

2

It's not the fact that the average is around 0 that makes

\tanh

$\tanh$ faster. It's the fact that being around zero means that the range is also grater (compared to being around 0.5 in the case of

sigmoid

$\text{sigmoid}$ ), which leads to larger derivatives, which almost always leads to faster convergence to the minimum. I hope that it is clear now. Ng is right that we prefer the

\tanh

$\tanh$ function because it is centered around 0, but he just didn't provide the complete justification.

— ekoulier

Zero-centering is more important than

2 x

$2x$ ratio, because it skews the distribution of activations and that hurts the performance. If you take sigmoid(x) - 0.5 and

2 x

$2x$ smaller learning rate, it will learn on par with tanh.

— Maxim

@Maxim Which "it" skews the distribution of activations, zero-centering or

2 x

$2x$ ? If zero-centering is a Good Thing, I still don't feel that the "why" of that has been answered.

— Tom Hale

3

Answering the part of the question so far unaddressed:

Andrew Ng says that using the logistic function (commonly know as sigmoid) really only makes sense in the final layer of a binary classification network.

As the output of the network is expected to be between $0$ and $1$ , the logistic is a perfect choice as it's range is exactly $(0, 1)$ . No scaling and shifting of $tanh$ required.

— Tom Hale
소스

For the output, the logistic function makes sense if you want to produce probabilities, we can all agree on that. What is being discussed is why tanh is preferred over the logistic function as an activation for the middle layers.

— Ricardo Cruz

How do you know that's what the OP intended? It seems he was asking a general question.

— Tom Hale

2

It all essentially depends on the derivatives of the activation function, the main problem with the sigmoid function is that the max value of its derivative is 0.25, this means that the update of the values of W and b will be small.

The tanh function on the other hand, has a derivativ of up to 1.0, making the updates of W and b much larger.

This makes the tanh function almost always better as an activation function (for hidden layers) rather than the sigmoid function.

To prove this myself (at least in a simple case), I coded a simple neural network and used sigmoid, tanh and relu as activation functions, then I plotted how the error value evolved and this is what I got.

The full notebook I wrote is here https://www.kaggle.com/moriano/a-showcase-of-how-relus-can-speed-up-the-learning

If it helps, here are the charts of the derivatives of the tanh function and the sigmoid one (pay attention to the vertical axis!)

— Juan Antonio Gomez Moriano
소스

(-1) Although this is an interesting idea, it doesn't stand on it's own. In particular, most optimization methods used for DL/NN are first order gradient methods, which have a learning rate

α

$\alpha$ . If the max derivative with regards to one activation function is too small, one could easily just increase the learning rate.

— Cliff AB

Don't you run the risk of not having a stable learning curve with a higher learning rate?

— Juan Antonio Gomez Moriano

Well, if the derivatives are more stable, then increasing the learning rate is less likely to destablize the estimation.

— Cliff AB

That's a fair point, do you have a link where I could learn more of this?

— Juan Antonio Gomez Moriano