커널 화 된 SVM에 Gradient Descent가 가능합니까 (그렇다면 사람들이 왜 Quadratic Programming을 사용 하는가)?

21

사람들이 커널 화 된 SVM을 다룰 때 왜 이차 프로그래밍 기술 (예 : SMO)을 사용합니까? 그라데이션 하강에 어떤 문제가 있습니까? 커널과 함께 사용하는 것이 불가능합니까, 아니면 너무 느립니다 (그리고 왜?).

좀 더 자세한 내용은 다음과 같습니다. SVM을 조금 더 이해하려고 노력하면서 Gradient Descent를 사용하여 다음 비용 함수를 사용하여 선형 SVM 분류기를 학습했습니다.

$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \mathbf{x}^{(i)} + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$

다음 표기법을 사용하고 있습니다.

$\mathbf{w}$ 는 모델의 피쳐 가중치이고 $b$ 는 바이어스 파라미터입니다.
$\mathbf{x}^{(i)}$ 는 $i^\text{th}$ 훈련 인스턴스의 특징 벡터입니다.
$y^{(i)}$ 는 $i^\text{th}$ 인스턴스의 대상 클래스 (-1 또는 1)입니다.
$m$ 은 훈련 인스턴스 수입니다.
$C$ 는 정규화 하이퍼 파라미터입니다.

이 방정식에서 (하위) 그라디언트 벡터 ( $\mathbf{w}$ 및 $b$ )를 도출했으며 Gradient Descent는 정상적으로 작동했습니다.

이제 비선형 문제를 해결하고 싶습니다. 비용 함수에서 모든 내적을 $\mathbf{u}^t \cdot \mathbf{v}$ 를 $K(\mathbf{u}, \mathbf{v})$ 로 바꿀 수 있습니까? 여기서 $K$ 는 커널 함수입니다 (예 : Gaussian RBF, $K(\mathbf{u}, \mathbf{v}) = e^{-\gamma \|\mathbf{u} - \mathbf{v}\|^2}$ ), 다음 (하위) 그라데이션 벡터를 도출하고 그라데이션 하강과 함께 진행하는 미적분을 사용할 수 있습니까?

너무 느리면 왜 그렇습니까? 비용 함수가 볼록하지 않습니까? 아니면 기울기가 너무 빠르게 변하기 때문에 (립 쉬츠 연속이 아님) 하강 중에 알고리즘이 계곡을 가로 질러 계속 점프하므로 매우 느리게 수렴합니까? 그럼에도 불구하고 어떻게 이차 프로그래밍의 시간 복잡성, 즉 보다 더 나쁠 수 $O({n_\text{samples}}^2 \times n_\text{features})$ 있습니까? 그것이 국소 적 최소의 문제라면, 모의 어닐링을 이용한 확률 론적 GD가 그것들을 극복 할 수 없는가?

svm kernel-trick gradient-descent

— 미니 쿼크
소스

6

설정 되도록 및 와 여기서, 원래 입력 행렬 의 매핑입니다. $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$ $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$ $\phi(x)$ $\mathbf x$ . 이를 통해 최초의 공식화를 통해 SVM을 해결할 수 있습니다. 손실에 대한 표기법 사용 :

J (w, b) = C \sum_{i = 1}^{m} m a x (0, 1 - y^{(i)} (u^{t} \cdot K^{(i)} + b)) + \frac{1}{2} u^{t} \cdot K \cdot u

$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$

는 행렬이고, 는 행렬입니다. 무한한 것도 아닙니다. $\mathbf{K}$ $m \times m$ $\mathbf{u}$ $m \times 1$

실제로, 이중은 일반적으로 더 빨리 풀리지 만, 초기에는 근사 솔루션 (이중 공식에서는 보장되지 않음)과 같은 장점도 있습니다.

이제, 왜 훨씬 더 눈에 띄는 듀얼 전혀 명확하지 않다입니다 : [1]

지난 10 년간 대부분의 연구가 이중 최적화에 관한 역사적 이유는 불분명하다 . 우리는 SVM이 하드 마진 공식에서 처음 소개 되었기 때문에 [Boser et al., 1992], 이중 최적화 (제약 조건으로 인해)가 더 자연스러운 것처럼 보입니다. 그러나 일반적으로 훈련 데이터를 분리 할 수있는 경우에도 소프트 마진 SVM이 선호되어야합니다. 더 많은 훈련 지점이 고려되므로 결정 경계가 더욱 강력합니다 [Chapelle et al., 2000]

Chapelle (2007)은 원시 최적화와 이중 최적화의 시간 복잡성이 이며 최악의 경우는 라고 주장하지만, 2 차 및 대략적인 힌지 손실을 분석하므로 적절하지 않습니다. 뉴턴의 방법과 함께 사용하기에 차별화되지 않기 때문에 힌지 손실. $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$ $\mathcal{O}\left(n^3\right)$

_{[1] Chapelle, O. (2007). 원초에서지지 벡터 머신을 훈련시킵니다. 신경 계산, 19 (5), 1155-1178.}

— 방화범
소스

1

+1 시간 복잡성에 따라 확장 할 수 있습니까

— seanv507

@ seanv507 감사합니다. 실제로이 문제를 해결해야 했으며이 답변을 곧 업데이트 할 것입니다.

— Firebug

4

모든 입력 가중치 벡터 ( )에 변환 을 적용 하면 다음과 같은 비용 함수를 얻게됩니다. $\phi$ $\mathbf{x}^{(i)}$

$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$

The kernel trick replaces $\phi(\mathbf{u})^t \cdot \phi(\mathbf{v})$ by $K(\mathbf{u}, \mathbf{v})$ . Since the weight vector $\mathbf{w}$ is not transformed, the kernel trick cannot be applied to the cost function above.

The cost function above corresponds to the primal form of the SVM objective:

$\underset{\mathbf{w}, b, \mathbf{\zeta}}\min{C \sum\limits_{i=1}^m{\zeta^{(i)}} + \dfrac{1}{2}\mathbf{w}^t \cdot \mathbf{w}}$

$y^{(i)}(\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b) \ge 1 - \zeta^{(i)})$ and $\zeta^{(i)} \ge 0$ for $i=1, \cdots, m$

The dual form is:

$\underset{\mathbf{\alpha}}\min{\dfrac{1}{2}\mathbf{\alpha}^t \cdot \mathbf{Q} \cdot \mathbf{\alpha} - \mathbf{1}^t \cdot \mathbf{\alpha}}$

subject to $\mathbf{y}^t \cdot \mathbf{\alpha} = 0$ and $0 \le \alpha_i \le C$ for $i = 1, 2, \cdots, m$

where $\mathbf{1}$ is a vector full of 1s and $\mathbf{Q}$ is an $m \times m$ matrix with elements $Q_{ij} = y^{(i)} y^{(j)} \phi(\mathbf{x}^{(i)})^t \cdot \phi(\mathbf{x}^{(j)})$ .

Now we can use the kernel trick by computing $Q_{ij}$ like so:

$Q_{ij} = y^{(i)} y^{(j)} K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$

So the kernel trick can only be used on the dual form of the SVM problem (plus some other algorithms such as logistic regression).

Now you can use off-the-shelf Quadratic Programming libraries to solve this problem, or use Lagrangian multipliers to get an unconstrained function (the dual cost function), then search for a minimum using Gradient Descent or any other optimization technique. One of the most efficient approach seems to be the SMO algorithm implemented by the libsvm library (for kernelized SVM).

— MiniQuark
소스

1

I'm not sure why you marked your answer Community Wiki. This seems like a perfectly valid answer to your question.

— Sycorax says Reinstate Monica

Thanks @GeneralAbrial. I marked my answer as Community Wiki to avoid any suspicion that I knew the answer before asking the question.

— MiniQuark

1

You should always do what you think is right, but it's perfectly kosher to ask and answer your own question.

— Sycorax says Reinstate Monica

Wait, couldn't you transform the weight vector to

w = ϕ (x) \cdot u

$\mathbf w = \phi(x)\cdot \mathbf u$ so that

w^{t} ϕ (x) = u \cdot K

$\mathbf w^t \phi(x)=\mathbf u \cdot \mathbf K$ and

w^{t} w = u^{t} K u

$\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$ , with

K = ϕ^{t} ϕ

$\mathbf K = \phi^t\phi$ , and then optimize the sample weights

u

$\mathbf u$ ?

— Firebug

2

I might be wrong, but I don't see how we can replace the dot products with kernels without turning it into the dual problem.

The kernels map the input implicitly to some feature space where $x$ becomes $\phi(x)$ , the loss function then becomes
$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$
If Gaussian kernel is applied, $\phi(\mathbf{x}^{(i)})$ will have ifinite dimensions, so will $\mathbf{w}$ .

It seems difficult to optimize a vector of infinite dimensions using gradient descent directly.

Update
Firebug's answer gives a way of replacing the dot products with kernels in the primal formulation.

— dontloo
소스