2 차 미분이 볼록 최적화에 유용한 이유는 무엇입니까?

18

나는 이것이 기본적인 질문이라고 생각하고 그라디언트 자체의 방향과 관련이 있지만 2 차 방법 (예 : BFGS )이 간단한 그라디언트 디센트보다 효과적인 예제를 찾고 있습니다.

optimization

— 바
소스

3

"포물선의 꼭짓점 찾기"가 "이 선형 함수의 최소값 찾기"보다 "최소값 찾기"문제에 대한 훨씬 더 나은 근사라는 것을 관찰하는 것이 너무 단순합니까? 선의)?

20

다음은 그래디언트 디센트와 뉴턴의 방법을 모두 해석하기위한 공통 프레임 워크입니다.이 차이를 @Sycorax의 답변에 대한 보완으로 생각할 수 있습니다. (BFGS는 Newton의 방법에 근사합니다. 특히 여기서는 이야기하지 않겠습니다.)

함수 최소화하고 $f$ 있지만 직접 수행하는 방법을 모릅니다. 따라서 현재 지점 에서 근사값을 $x$ 구하여 최소화합니다.

Newton의 방법은 2 차 Taylor 확장을 사용하여 함수를 근사합니다. 의 기울기이고 지점에서 및 의 헤센 . 그런 다음 를 단계로반복합니다.

f (y) \approx N_{x} (y) := f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2} (y - x)^{T} \nabla^{2} f (x) (y - x),

$f(y) \approx N_x(y) := f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y - x)^T \, \nabla^2 f(x) \, (y - x) ,$

\nabla f (x)

$\nabla f(x)$

f

$f$

x

$x$

\nabla^{2} f (x)

$\nabla^2 f(x)$

x

$x$

\arg min_{y} N_{x} (y)

$\arg\min_y N_x(y)$

@Hurkyl이 지적했듯이 최소 기울기가 없기 때문에 Hessian이 아닌 기울기가있는 그라디언트 디센트는 1 차 근사치를 만들고 최소화 할 수 없습니다. 대신, 단계 크기 와 단계를 합니다. 그러나 $t$ $x - t \nabla f(x)$ 따라서 경사 하강은 함수최소화합니다

\begin{aligned} x - t \nabla f (x) & = \arg max_{y} [f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2 t} ‖ y - x ‖^{2}] \\ = \arg max_{y} [f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2} (y - x)^{T} \frac{1}{t} I (y - x)] . \end{aligned}

$\begin{align} x - t \,\nabla f(x) &= \arg\max_y \left[f(x) + \nabla f(x)^T (y - x) + \tfrac{1}{2 t} \lVert y - x \rVert^2\right] \\&= \arg\max_y \left[f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y-x)^T \tfrac{1}{t} I (y - x)\right] .\end{align}$

G_{x} (y) := f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2} (y - x)^{T} \frac{1}{t} I (y - x) .

$G_x(y) := f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y-x)^T \tfrac{1}{t} I (y - x).$

따라서 경사 하강은 뉴턴의 방법을 사용하는 것과 비슷하지만 2 차 테일러 확장을 취하는 대신 헤 시안이 척합니다.. 이는 종종보다대한 대략적인 근사치이며, 따라서 경사 하강은 종종 뉴턴의 방법보다 훨씬 더 나쁜 단계를 취합니다. 물론 이것은 뉴턴 방법의 각 단계보다 계산하기에 훨씬 더 저렴한 경사 하강 단계에 의해 균형이 잡힌다. 더 나은 방법은 전적으로 문제의 본질, 계산 자원 및 정확성 요구 사항에 달려 있습니다. $\tfrac1t I$ $G$ $f$ $N$

2 차 을 최소화하는 @Sycorax의 예 를 보면 잠시 동안이 관점이 두 방법을 이해하는 데 도움이된다는 점에 주목할 가치가 있습니다.

f (x) = \frac{1}{2} x^{T} A x + d^{T} x + c

$f(x) = \tfrac12 x^T A x + d^T x + c$

Newton의 방법을 사용하면 되므로 단일 단계에서 정확한 답변 (부동 소수점 정확도 문제까지)으로 끝납니다. $N = f$

반면 그라디언트 디센트는

G_{x} (y) = f (x) + (A x + d)^{T} y + \frac{1}{2} (x - y)^{T} \frac{1}{t} I (x - y)

$G_x(y) = f(x) + (A x + d)^T y + \tfrac12 (x - y)^T \tfrac1t I (x-y)$

x

$x$

A

$A$

— 더갈
소스

1

이것은 @Aksakal의 답변 과 비슷 하지만 더 깊이 있습니다.

— Dougal

1

(+1) 이것은 훌륭한 추가입니다!

— Sycorax는 Reinstate Monica

17

본질적으로 Newton의 방법과 같은 2 차 파생 방법의 장점은 2 차 종료 품질을 갖는다는 것입니다. 이는 유한 한 단계로 2 차 함수를 최소화 할 수 있음을 의미합니다. 그래디언트 디센트 (gradient descent)와 같은 방법은 학습 속도에 크게 의존하는데, 이로 인해 최적화가 최적의 속도로 튀어 나와서 느리게 수렴하거나 완전히 분기 될 수 있습니다. 안정적인 학습 속도를 찾을 수 있지만 헤 시안 계산이 필요합니다. 안정적인 학습 속도를 사용하더라도 최적의 진동 수준과 같은 문제가 발생할 수 있습니다. 즉, 항상 "직접적인"또는 "효율적인"경로를 최소한으로 향하지는 않습니다. 따라서 종료 하더라도 많은 반복이 필요할 수 있습니다.당신은 상대적으로 가까이 있습니다. BFGS와 Newton의 방법은 각 단계의 계산 노력이 더 비싸더라도 더 빨리 수렴 할 수 있습니다.

예제 요청 : 목적 함수 이라고 가정합니다.

F (x) = \frac{1}{2} x^{T} A x + d^{T} x + c

$F(x)=\frac{1}{2}x^TAx+d^Tx+c$ The gradient is

\nabla F (x) = A x + d

$\nabla F(x)=Ax+d$ and putting it into the steepest descent form with constant learning rate

x_{k + 1} = x_{k} - α (A x_{k} + d) = (I - α A) x_{k} - α d .

$x_{k+1}= x_k-\alpha(Ax_k+d) = (I-\alpha A)x_k-\alpha d.$

This will be stable if the magnitudes of the eigenvectors of $I-\alpha A$ are less than 1. We can use this property to show that a stable learning rate satisfies

α < \frac{2}{λ_{m a x}},

$\alpha<\frac{2}{\lambda_{max}},$ where

λ_{m a x}

$\lambda_{max}$ is the largest eigenvalue of

A

$A$ . The steepest descent algorithm's convergence rate is limited by the largest eigenvalue and the routine will converge most quickly in the direction of its corresponding eigenvector. Likewise, it will converge most slowly in directions of the eigenvector of the smallest eigenvalue. When there is a large disparity between large and small eigenvalues for

A

$A$ , gradient descent will be slow. Any

A

$A$ with this property will converge slowly using gradient descent.

In the specific context of neural networks, the book Neural Network Design has quite a bit of information on numerical optimization methods. The above discussion is a condensation of section 9-7.

— Sycorax says Reinstate Monica
소스

Great answer! I'm accepting @Dougal 's answer as I think it provides a simpler explanation.

— Bar

6

In convex optimization you are approximating the function as the second degree polynomial in one dimensional case:

f (x) = c + β x + α x^{2}

$f(x)=c+\beta x + \alpha x^2$

In this case the the second derivative

\partial^{2} f (x) / \partial x^{2} = 2 α

$\partial^2 f(x)/\partial x^2=2\alpha$

If you know the derivatives, then it's easy to get the next guess for the optimum:

guess = - \frac{β}{2 α}

$\text{guess}=-\frac{\beta}{2\alpha}$

The multivariate case is very similar, just use gradients for derivatives.

— Aksakal
소스

2

@Dougal already gave a great technical answer.

The no-maths explanation is that while the linear (order 1) approximation provides a “plane” that is tangential to a point on an error surface, the quadratic approximation (order 2) provides a surface that hugs the curvature of the error surface.

The videos on this link do a great job of visualizing this concept. They display order 0, order 1 and order 2 approximations to the function surface, which just intuitively verifies what the other answers present mathematically.

Also, a good blogpost on the topic (applied to neural networks) is here.

— Zhubarb
소스