예를 들어, 번째 반복 에서 XGBoost 모델의 목적 함수를 사용하십시오 . $t$

L^{(t)} = \sum_{i = 1}^{n} ℓ (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t})

$\mathcal{L}^{(t)}=\sum_{i=1}^n\ell(y_i,\hat{y}_i^{(t-1)}+f_t(\mathbf{x}_i))+\Omega(f_t)$

여기서 은 손실 함수이고, 는 번째 트리 출력이고 는 정규화입니다. 빠른 계산을위한 (다수) 주요 단계 중 하나는 근사치입니다. $\ell$ $f_t$ $t$ $\Omega$

L^{(t)} \approx \sum_{i = 1}^{n} ℓ (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{t} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i}) + Ω (f_{t}),

$\mathcal{L}^{(t)}\approx \sum_{i=1}^n\ell(y_i,\hat{y}_i^{(t-1)})+g_tf_t(\mathbf{x}_i)+\frac{1}{2}h_if_t^2(\mathbf{x}_i)+\Omega(f_t),$

여기서 $g_i$ 및 $h_i$ 는 손실 함수의 1 차 및 2 차 미분입니다.

내가 요구하는 것은 위의 근사가 작동하는 이유를 명확하게 주장하는 설득력있는 주장입니다.

1) 위의 근사값을 가진 XGBoost는 전체 목적 함수를 가진 XGBoost와 어떻게 비교됩니까? 근사치에서 잠재적으로 흥미로운 고차 거동은 무엇입니까?

2) 시각화하기가 약간 어렵고 (손실 함수에 따라 다름) 손실 함수에 입방 성분이 큰 경우 근사화가 실패 할 수 있습니다. 이것이 XGBoost에 문제를 일으키지 않는 이유는 무엇입니까?

— 알렉스 알
소스

이것은 매우 흥미로운 질문입니다. 무슨 일이 있었는지 완전히 이해하기 위해 XGBoost가 무엇을하려고하는지, 도구 상자에서 다른 방법으로 처리해야했습니다. 내 대답은 전통적인 방법과 XGBoost가 개선 된 방법 / 이유에 대해 설명합니다. 글 머리 기호 만 원하는 경우 끝에 요약이 있습니다.

전통적인 그라디언트 부스팅

전통적인 그라디언트 부스팅 알고리즘 (Wikipedia)을 고려하십시오 .

계산 기본 모델 $H_0$

용 $m \leftarrow 1:M$

의사 잔차 계산 $r_{im} = -\frac{\partial \ell(y_i, H_{m-1}(x_i))}{\partial H_{m-1}(x_i)}$

기본 학습자 $h_m(x)$ 를 의사 잔차에 맞추십시오.

Compute the multiplier $\gamma$ that minimizes the cost, $\gamma = \arg \min_\gamma \sum_{i=1}^N \ell(y_i, H_{m-1}(x_i) + \gamma h_m(x_i))$ , (using line search)

Update the model $H_m(x) = H_{m-1}(x) + \gamma h_m(x)$ .

부스트 모델 $H_M(x)$ 얻습니다 .

함수 근사값은 다음 부분에서 중요합니다.

기본 학습자 $h_m(x)$ 를 의사 잔차에 맞 춥니 다.

그라데이션 부스팅 알고리즘을 순진하게 구성 할 위치를 상상해보십시오. 기존 회귀 트리를 약한 학습자로 사용하여 위의 알고리즘을 작성합니다. 약한 학습자의 기존 구현을 조정할 수 없다고 가정 해 봅시다. 에서 matlab에 기본 분할 기준은 평균 제곱 오차이다. scikit learn도 마찬가지 입니다.

You are trying to find the best model $h_m(x)$ that minimize the cost $\ell(y_i, H_{m-1}(x_i) + h_m(x_i))$ . But to do so, you are fitting a simple regression model to the residuals using the MSE as objective function. Notice that you are not directly minimizing what you want, but using the residuals and the MSE as a proxy to do so. The bad part is that it does not necessarily yields the optimal solution. The good part is that it work.

Traditional Gradient Descent

This is analogous to the traditional Gradient Descent (Wikipedia), where you are trying to minimize a cost function $f(x)$ by following the (negative of the) gradient of the function, $-\nabla f(x)$ at every step.

x^{(i + 1)} = x^{(i)} - \nabla f (x^{(i)})

$x^{(i+1)} = x^{(i)} - \nabla f(x^{(i)})$

It does not allow you to find the exact minimum after one step, but each step gets you closer to the minimum (if the function is convex). This is an approximation, but it works very well and it is the algorithm we traditionally use to do a logistic regression, for example.

Interlude

At this point, the thing to understand is that the general gradient boosting algorithm does not compute the cost function $\ell$ for each possible splits, it uses the cost function of the regression weak learner to fit the residuals.

What your question seems to imply is that the "true XGBoost" should compute the cost function for each split, and that the "approximate XGBoost" is using a heuristic to approximate it. You can see it that way, but historically, we have had the general gradient boosting algorithm, which does not use information about the cost function, except the derivative at the current point. XGBoost is an extension to Gradient Boosting that tries to be smarter about growing the weak regression trees by using a more accurate approximation than just the gradient.

Other ways to choose the best model $h_m(x)$

If we take a look at AdaBoost as special case of gradient boosting, it does not selects regressors but classifiers as weak learners. If we set $h_m(x) \in \{-1,1\}$ , the way AdaBoost selects the best model is by finding

h_{m} = \arg max_{h_{m}} \sum_{i = 1}^{N} w_{i} h_{m} (x_{i})

$h_m = \arg \max_{h_m} \sum_{i=1}^N w_i h_m(x_i)$

where $w_i$ are the residuals (source, starts at slide 20). The reasoning for the use of this objective function is that if $w_i$ and $h_m(x_i)$ go in the same direction/have the same sign, the point is moving to the right direction, and you are trying to maximize the maximum amount of movement in the right direction.

But once again, this is not directly measuring which $h_m$ minimizes $\ell(y_i, H_{m-1}(x_i) + h_m(x_i))$ . It is measuring how good the move $h_m$ is, with respect with the overall direction you should go, as measured with the residuals $w_i$ , which are also an approximation. The residuals tell you in which direction you should be moving by their sign, and roughly by how much by their magnitude, but they do not tell you exactly where you should stop.

Better Gradient Descent

The next three examples are not essential to the explanation and are just here to present some ways to do better than a vanilla gradient descent, to support the idea that what XGBoost does is just another way of improving on gradient descent. In a traditional gradient descent setting, when trying to minimize $f(x)$ , it is possible to do better than just following the gradient. Many extensions have been proposed (Wikipedia). Here are some of them, to show that it is possible to do better, given more computation time or more properties of the function $f$ .

Line search/Backtracking: In Gradient Descent, once the gradient $-\nabla f(x^{(i)})$ is computed, the next point should be

$x^{(i + 1)} = x^{(i)} - \nabla f (x^{(i)})$ $x^{(i+1)} = x^{(i)} - \nabla f(x^{(i)})$

But the gradient gives only the direction in which one should move, not really by "how much", so another procedure can be used, to find the best $c > 0$ such that

$x_{c}^{(i + 1)} = x^{(i)} - c \nabla f (x^{(i)})$ $x_c^{(i+1)} = x^{(i)} - c \nabla f(x^{(i)})$

minimizes the cost function. This is done be evaluating $f(x_c^{(i+1)})$ for some $c$ , and since the function $f$ should be convex, it is relatively easy to do through Line Search (Wikipedia) or Backtracking Line Search (Wikipedia). Here, the main cost is evaluation $f(x)$ . So this extension works best if $f$ is easy to compute. Note that the general algorithm for gradient boosting uses line search, as shown in the beginning of my answer.
Fast proximal gradient method: If the function to minimize is strongly convex, and its gradient is smooth (Lipschitz (Wikipedia)), then there is some trick using those properties that speeds up the convergence.
Stochastic Gradient Descent and the Momentum method: In Stochastic Gradient Descent, you do not evaluate the gradient on all points, but only on a subset of those points. You take a step, then compute the gradient on another batch, and continue on. Stochastic Gradient Descent may be used because the computation on all points is very expensive, or maybe all those points do not even fit into memory. This allows you to take more steps, more quickly, but less accurately.

When doing so, the direction of the gradient might change depending on which points are sampled. To counteract this effect, the momentum methods keeps a moving average of the direction for each dimension, reducing the variance in each move.

The most relevant extension to gradient descent in our discussion of XGBoost is Newton's method (Wikipedia). Instead of just computing the gradient and following it, it uses the second order derivative to gather more information about the direction it should go into. If we use gradient descent, we have that at each iteration, we update our point $x^{(i)}$ as follow,

x^{(i + 1)} = x^{(i)} - \nabla f (x^{(i)})

$x^{(i+1)} = x^{(i)} - \nabla f(x^{(i)})$

And since the gradient $\nabla f(x^{(i)})$ points to the direction of highest increase in $f$ , its negative points in the direction of highest decrease, and we hope that $f(x^{(i+1)}) < f(x^{(i)})$ . This might not hold, as we might go too far in the direction of the gradient (hence the line search extension), but it is a good approximation. In Newton's method, we update $x^{(i)}$ as follow,

x^{(i + 1)} = x^{(i)} - \frac{\nabla f (x^{(i)})}{Hess f (x^{(i)})}

$x^{(i+1)} = x^{(i)} - \frac{\nabla f(x^{(i)})}{\text{Hess} f(x^{(i)})}$

Where $\text{Hess} f(x)$ is the Hessian of $f$ in $x$ . This update takes into account second order information, so the direction is no longer the direction of highest decrease, but should point more precisely towards the $x^{(i+1)}$ such that $f(x^{(i+1)}) = 0$ (or the point where $f$ is minimal, if there is no zero). If $f$ is a second order polynomial, then Newton's method coupled with a line search should be able to find the minimum in one step.

Newton's method contrasts with Stochastic gradient descent. In Stochastic Gradient Descent, we use less point to take less time to compute the direction we should go towards, in order to make more of them, in the hope we go there quicker. In Newton's method, we take more time to compute the direction we want to go into, in the hope we have to take less steps in order to get there.

Now, the reason why Newton's method works is the same as to why the XGBoost approximation works, and it relies on Taylor's expansion (Wikipedia) and Taylor's theorem (Wikipedia). The Taylor expansion (or Taylor series) of a function at a point $f(x + a)$ is

f (x) + \frac{\partial f (x)}{\partial x} a + \frac{1}{2} \frac{\partial^{2} f (x)}{\partial x^{2}} a^{2} + \dots = \sum_{n = 0}^{\infty} \frac{1}{n!} \frac{\partial^{n} f (x)}{\partial x^{n}} a^{n} .

$f(x) + \frac{\partial f(x)}{\partial x}a + \frac{1}{2}\frac{\partial^2 f(x)}{\partial x^2}a^2 + \cdots = \sum_{n=0} ^\infty \frac{1}{n!} \frac{\partial^n f(x)}{\partial x^n}a^n.$

Note the similarity between this expression and the approximation XGBoost is using. Taylor's Theorem states that if you stop the expansion at the order $k$ , then the error, or the difference between $f(x+a)$ and $\sum_{n=0}^k \frac{1}{n!}\frac{\partial^n f(x)}{\partial x^n}a^n$ , is at most $h_k(x) a^k$ , where $h_k$ is a function with the nice property that it goes to zero as $a$ goes to zero.

If you want some visualisation of how well it approximate some functions, take a look at the wikipedia pages, they have some graphs for the approximation of non-polynomial function such as $e^x$ , $\log(x)$ .

The thing to note is that approximation works very well if you want to compute the value of $f$ in the neighbourhood of $x$ , that is, for very small changes $a$ . This is what we want to do in Boosting. Of course we would like to find the tree that makes the biggest change. If the weak learners we build are very good and want to make a very big change, then we can arbitrarily hinder it by only applying $0.1$ or $0.01$ of its effect. This is the step-size or the learning rate of the gradient descent. This is acceptable, because if our weak learners are getting very good solutions, this means that either the problem is easy, in which case we are going to end up with a good solution anyway, or we are overfitting, so going a little or very much in this bad direction does not change the underlying problem.

So what is XGBoost doing, and why does it work?

XGBoost is a Gradient Boosting algorithm that build regression trees as weak learners. The traditional Gradient Boosting algorithm is very similar to a gradient descent with a line search, where the direction in which to go is drawn from the available weak learners. The naïve implementation of Gradient Boosting would use the cost function of the weak learner to fit it to the residual. This is a proxy to minimize the cost of the new model, which is expensive to compute. What XGBoost is doing is building a custom cost function to fit the trees, using the Taylor series of order two as an approximation for the true cost function, such that it can be more sure that the tree it picks is a good one. In this respect, and as a simplification, XGBoost is to Gradient Boosting what Newton's Method is to Gradient Descent.

Why did they build it that way

Your question as to why using this approximation comes to a cost/performance tradeoff. This cost function is used to compare potential splits for regression trees, so if our points have say 50 features, with an average of 10 different values, each node has 500 potential splits, so 500 evaluation of the function. If you drop a continuous feature, the number of splits explode, and the evaluation of the split is called more and more (XGBoost has another trick to deal with continuous features, but that's out of the scope). As the algorithm will spend most of its time evaluating splits, the way to speed up the algorithm is to speed up tree evaluation.

If you evaluated the tree with the full cost function, $\ell$ , it is a new computation for every new split. In order to do optimization in the computation of the cost function, you would need to have information about the cost function, which is the whole point of Gradient Boosting: It should work for every cost function.

The second order approximation is computationally nice, because most terms are the same in a given iteration. For a given iteration, most of the expression can be computed once, and reused as constant for all splits:

L^{(t)} \approx \sum_{i = 1}^{n} \underset{constant}{\underset{⏟}{ℓ (y_{i}, {\hat{y}}_{i}^{(t - 1)})}} + \underset{constant}{\underset{⏟}{g_{i}}} f_{t} (x_{i}) + \frac{1}{2} \underset{constant}{\underset{⏟}{h_{i}}} f_{t}^{2} (x_{i}) + Ω (f_{t}),

$\mathcal{L}^{(t)}\approx \sum_{i=1}^n \underbrace{\ell(y_i,\hat{y}_i^{(t-1)})}_{\text{constant}}+\underbrace{g_i}_{\text{constant}}f_t(\mathbf{x}_i)+\frac{1}{2}\underbrace{h_i}_{\text{constant}}f_t^2(\mathbf{x}_i)+\Omega(f_t),$

So the only thing you have to compute is $f_t(x_i)$ and $\Omega(f_t)$ , and then what is left is mostly additions, and some multiplications. Moreover, if you take a look at the XGBoost paper (arxiv), you will see that they use the fact that they are building a tree to further simplify the expression down to a bunch of summation of indexes, which is very, very quick.

Summary

You can see XGBoost (with approximation) as a regression from the exact solution, an approximation of the "true XGBoost", with exact evaluation. But since the exact evaluation is so costly, another way to see it is that on huge datasets, the approximation is all we can realistically do, and this approximation is more accurate than the first-order approximation a "naïve" gradient boosting algorithm would do.

The approximation in use is similar to Newton's Method, and is justified by Taylor Series (Wikipedia) and Taylor Theorem (Wikipedia).

Higher order information is indeed not completely used, but it is not necessary, because we want a good approximation in the neighbourhood of our starting point.

For visualisation, check the Wikipedia page of Taylor Series/Taylor's Theorem, or Khan Academy on Taylor Series approximation, or MathDemo page on polynomial approximation of non-polynomials

— Winks
소스

+1. I must confess that I haven't read this answer (yet?) and cannot judge on it anyway because it's outside of my expertise, but it looks so impressive that I am happy to upvote. Well done [it seems]!

— amoeba says Reinstate Monica

That was an excellent answer. I have one question though.The Gradient boosting algorithm fits a regression tree to the negative gradient with split criterion the mse. How the tree structure is determined in XGBoost??

— gnikol

You've nailed the answer, good job!

— Marcin Zablocki

Taylor 확장을 통한 XGBoost 손실 기능 근사