15

커널 SVM의 직관을 이해하려고합니다. 이제 선형 SVM의 작동 방식을 이해하고 데이터를 최대한 분할하는 의사 결정 라인을 만듭니다. 또한 데이터를 더 높은 차원의 공간으로 포팅하는 원리와 이것이 새로운 공간에서 선형 의사 결정 라인을 더 쉽게 찾을 수있는 방법을 이해합니다. 내가 이해하지 못하는 것은 커널을 사용하여 데이터 포인트를이 새로운 공간에 투영하는 방법입니다.

제가 커널에 대해 알고있는 것은 두 데이터 포인트 사이의 "유사성"을 효과적으로 나타냅니다. 그러나 이것은 투영과 어떻게 관련이 있습니까?

machine-learning svm kernel-trick

— 카르 니 바우 루스
소스

3

차원 공간이 충분하면 평면에 의해 모든 훈련 데이터 포인트를 완벽하게 분리 할 수 있습니다. 그렇다고해서 예측력이있는 것은 아닙니다. 나는 매우 높은 차원의 공간으로가는 것이 도덕적으로 동등한 (일종의) 과적 합이라고 생각합니다.

— Mark L. Stone

@ Mark L. Stone : 맞습니다 (+1). 그러나 무한 차원 공간에서 커널을 어떻게 매핑 할 수 있는지 묻는 것이 여전히 좋은 질문 일 수 있습니까? 어떻게 작동합니까? 나는 시도, 내 대답을 참조하십시오

기능 매핑을 "투영"이라고 부르는 것에주의해야합니다. 기능 매핑은 일반적으로 비선형 변환입니다.

— Paul

커널 트릭에 대한 매우 유용한 게시물은 커널의 내부 제품 공간을 시각화하고이를 달성하기 위해 고차원의 특징 벡터가 사용되는 방법을 설명합니다. eric-kim.net/eric-kim-net/ posts / 1 / kernel_trick.html

— JStrahl

6

를 고차원 공간 로의 투영 이라고하자 . 기본적 커널 함수 의 내적이다. 따라서 데이터 포인트를 투영하는 데 사용되는 것이 아니라 예상 결과입니다. 유사성의 척도로 간주 될 수 있지만 SVM에서는 그 이상입니다. $h(x)$ $\mathcal{F}$ $K(x_1,x_2)=\langle h(x_1),h(x_2)\rangle$

최고의 분리 초평면 찾기위한 최적 포함 만 내적 양식을 통하여. 즉, 를 알고 있으면 정확한 형태의 를 알 필요가 없으므로 최적화가 더 쉬워집니다. $\mathcal{F}$ $h(x)$ $K(\cdot,\cdot)$ $h(x)$

각 커널 에도 해당 가 있습니다. 따라서 해당 커널과 함께 SVM을 사용하는 경우 매핑되는 공간에서 선형 결정 라인을 암시 적으로 찾습니다 . $K(\cdot,\cdot)$ $h(x)$ $h(x)$

통계 학습 의 요소 12 장 에서는 SVM에 대해 간략하게 소개합니다. 이것은 커널과 기능 맵핑 사이의 연결에 대한 자세한 정보를 제공합니다. http://statweb.stanford.edu/~tibs/ElemStatLearn/

— 리
소스

커널

경우 고유 한 기본

가 있다는 것을 의미 합니까?

K (x, y)

$K(x,y)$

h (x)

$h(x)$

2

@fcoppens 아니요; 사소한 예를 들어, 고려

과

. 그러나 해당 커널에 해당하는 고유 한 재생 커널 Hilbert 공간이 있습니다.

h

$h$

- h

$-h$

— Dougal

@Dougal : 그럼 당신과 동의 할 수 있지만, 위의 대답에서 '해당

' 라고 말했기 때문에 확신하고 싶었습니다. RKHS에 대해서는 알지만,이 변환

가 커널

대해 어떻게 보이는지 '직관적 인 방법으로'설명 할 수 있다고 생각 하십니까?

h

$h$

h

$h$

K (x, y)

$K(x,y)$

@fcoppens 일반적으로, 아니; 이러한지도를 명시 적으로 나타내는 것은 어렵습니다. 그러나 특정 커널의 경우 너무 어렵거나 이전에 수행되지 않았습니다.

— Dougal

1

@fcoppens 당신이 맞아, h (x)는 독특하지 않습니다. 내부 제품 <h (x), h (x ')>을 동일하게 유지하면서 h (x)를 쉽게 변경할 수 있습니다. 그러나이를 기본 기능으로 간주 할 수 있으며 그 범위 (예 : RKHS)는 고유합니다.

— Lii

4

커널 SVM의 유용한 속성은 보편적이지 않으며 커널의 선택에 따라 다릅니다. 직관을 얻으려면 가장 일반적으로 사용되는 커널 중 하나 인 가우시안 커널을 보는 것이 도움이됩니다. 놀랍게도이 커널은 SVM을 k- 최근 접 이웃 분류기와 매우 유사한 것으로 바꿉니다.

이 답변은 다음을 설명합니다.

대역폭이 충분히 작은 가우스 커널을 사용하여 긍정적이고 부정적인 훈련 데이터를 완벽하게 분리 할 수있는 이유는 무엇입니까?
이 분리가 피쳐 공간에서 선형으로 해석되는 방법.
커널을 사용하여 데이터 공간에서 기능 공간으로의 맵핑을 구성하는 방법 스포일러 : 기능 공간은 수학적으로 매우 추상적 인 객체이며 커널을 기반으로하는 특이한 추상 내부 제품이 있습니다.

1. 완벽한 분리 달성

커널의 로컬 속성으로 인해 가우시안 커널을 사용하면 완벽한 분리가 가능하며, 이는 임의로 유연한 결정 경계를 이끌어냅니다. 커널 대역폭이 충분히 작은 경우 의사 결정 경계는 긍정적이고 부정적인 예를 분리해야 할 때마다 점 주위에 작은 원을 그린 것처럼 보입니다.

(크레딧 : Andrew Ng의 온라인 머신 러닝 과정 ).

그렇다면 왜 수학적인 관점에서 이런 일이 발생합니까?

표준 설정을 고려하십시오. 가우스 커널 및 훈련 데이터 $K(\mathbf{x},\mathbf{z}) = \exp(- ||\mathbf{x}-\mathbf{z}||^2 / \sigma^2)$ 값은 . 분류 자 함수를 배우고 싶습니다 $(\mathbf{x}^{(1)},y^{(1)}), (\mathbf{x}^{(2)},y^{(2)}), \ldots, (\mathbf{x}^{(n)},y^{(n)})$ $y^{(i)}$ $\pm 1$

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} K (x^{(i)}, x)

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x})$

이제 우리는 어떻게 이제까지 가중치를 할당합니다 ? 무한한 차원 공간과 2 차 프로그래밍 알고리즘이 필요합니까? 아니요, 포인트를 완벽하게 분리 할 수 있다는 것을 보여주고 싶기 때문입니다. 따라서 가장 작은 분리보다 10 억 배 작게 만듭니다 두 가지 훈련 예제 사이에서 설정했습니다 . 모든 훈련 포인트 떨어져까지 커널에 관한 한 억 sigmas가 있으며, 각 지점이 완전히의 부호 제어하는이 수단 $w_i$ $\sigma$ $||\mathbf{x}^{(i)} - \mathbf{x}^{(j)}||$ $w_i = 1$ $\hat{y}$ 그 동네에서. 공식적으로 우리는

\hat{y} (x^{(k)}) = \sum_{i = 1}^{n} y^{(k)} K (x^{(i)}, x^{(k)}) = y^{(k)} K (x^{(k)}, x^{(k)}) + \sum_{i \neq k} y^{(i)} K (x^{(i)}, x^{(k)}) = y^{(k)} + ϵ

$\hat{y}(\mathbf{x}^{(k)}) = \sum_{i=1}^n y^{(k)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} K(\mathbf{x}^{(k)},\mathbf{x}^{(k)}) + \sum_{i \neq k} y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} + \epsilon$

여기서 는 임의로 작은 값입니다. 우리는 알고 때문에 작다 그래서 모두 10 억 sigmas 떨어진 다른 지점으로부터 우리가 $\epsilon$ $\epsilon$ $\mathbf{x}^{(k)}$ $i \neq k$

K (x^{(i)}, x^{(k)}) = \exp (- | | x^{(i)} - x^{(k)} | |^{2} / σ^{2}) \approx 0.

$K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = \exp(- ||\mathbf{x}^{(i)} - \mathbf{x}^{(k)}||^2 / \sigma^2) \approx 0.$

이후 매우 확실히 같은 투표 , 및 분류 기준은 트레이닝 데이터에 최적의 정확도를 달성한다. 실제로 이것은 과도하게 적합하지만 가우시안 커널 SVM의 엄청난 유연성과 가장 가까운 이웃 분류기와 매우 유사하게 작동하는 방법을 보여줍니다. $\epsilon$ $\hat{y}(\mathbf{x}^{(k)})$ $y^{(k)}$

2. 선형 분리로서 커널 SVM 학습

이것이 "무한 차원 피처 공간에서 완벽한 선형 분리"로 해석 될 수 있다는 사실은 커널 트릭에서 비롯된 것입니다. 커널은 새로운 피처 공간을 추상적 인 내부 제품으로 해석 할 수 있습니다.

K (x^{(i)}, x^{(j)}) = ⟨ Φ (x^{(i)}), Φ (x^{(j)}) ⟩

$K(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x}^{(j)})\rangle$

여기서 는 데이터 공간에서 피처 공간으로의 매핑입니다. 이는 것을 바로 다음 , 특징 공간에서 선형 함수로 함수 : $\Phi(\mathbf{x})$ $\hat{y}(\mathbf{x})$

\hat{y} (x) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), Φ (x) ⟩ = L (Φ (x))

$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x})\rangle = L(\Phi(\mathbf{x}))$

여기서 선형 함수 는 피처 공간 벡터 에 정의 됩니다. $L(\mathbf{v})$ $\mathbf{v}$

L (v) = \sum_{i} w_{i} y^{(i)} ⟨ Φ (x^{(i)}), v ⟩

$L(\mathbf{v}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\mathbf{v}\rangle$

이 함수는 벡터에서 고정 된 벡터를 가진 내부 제품의 선형 조합이기 때문에 에서 선형입니다. 피쳐 공간에서 결정 경계 그냥 , 선형 함수의 레벨 세트. 이것은 피쳐 공간에서 하이퍼 플레인의 정의입니다. $\mathbf{v}$ $\hat{y}(\mathbf{x}) = 0$ $L(\mathbf{v}) = 0$

3. 기능 공간을 구성하기 위해 커널을 사용하는 방법

커널 메서드는 실제로 피처 공간이나 매핑 명시 적으로 "찾아"거나 "계산"하지 않습니다 . SVM과 같은 커널 학습 방법은 작동하지 않아도됩니다. 커널 함수 만 필요합니다 . 대한 공식을 작성할 수는 있지만 매핑되는 피처 공간은 매우 추상적이고 SVM에 대한 이론적 결과를 입증하는 데만 사용됩니다. 여전히 관심이 있다면 작동 방식은 다음과 같습니다. $\Phi$ $K$ $\Phi$

기본적으로 우리 는 각 벡터가 에서 까지의 함수 인 추상 벡터 공간 정의합니다 . 벡터 에서의 커널 슬라이스 한정된 선형 조합으로 이루어지는 함수이다 : 여기서합니다 ( 단지 임의적 포인트 세트이며 트레이닝 세트와 같을 필요는 없습니다.) 를 작성하는 것이 편리합니다. $V$ $\mathcal{X}$ $\mathbb{R}$ $f$ $V$

f (x) = \sum_{i = 1}^{n} α_{i} K (x^{(i)}, x)

$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i K(\mathbf{x}^{(i)},\mathbf{x})$

x^{(i)}

$\mathbf{x}^{(i)}$

f

$f$ 보다 작게

여기서

는

에서 커널의 "슬라이스"를 제공하는 함수 입니다.

f = \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}

$f = \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}$

K_{x} (y) = K (x, y)

$K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y})$

x

$\mathbf{x}$

공간의 내부 제품은 일반적인 내적 제품이 아니라 커널을 기반으로 한 추상 내부 제품입니다.

⟨ \sum_{i = 1}^{n} α_{i} K_{x^{(i)}}, \sum_{j = 1}^{n} β_{j} K_{x^{(j)}} ⟩ = \sum_{i, j} α_{i} β_{j} K (x^{(i)}, x^{(j)})

$\langle \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}, \sum_{j=1}^n \beta_j K_{\mathbf{x}^{(j)}} \rangle = \sum_{i,j} \alpha_i \beta_j K(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$

$\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \rangle = K(\mathbf{x},\mathbf{y})$

$\Phi$ $\mathcal{X} \rightarrow V$ $\mathbf{x}$

Φ (x) = K_{x}, where K_{x} (y) = K (x, y) .

$\Phi(\mathbf{x}) = K_\mathbf{x}, \quad \text{where} \quad K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y}).$

You can prove that $V$ is an inner product space when $K$ is a positive definite kernel. See this paper for details.

— Paul
소스

좋은 설명이지만 가우시안 커널의 정의에 대한 마이너스를 놓친 것 같습니다. K (x, z) = exp (-|| x−z || 2 / σ2). 기록되었으므로 부분 (1)에서 찾은 with와는 의미가 없습니다.

— hqxortn

1

For the background and the notations I refer to How to calculate decision boundary from support vectors?.

So the features in the 'original' space are the vectors $x_i$ , the binary outcome $y_i \in \{-1, +1\}$ and the Lagrange multipliers are $\alpha_i$ .

As said by @Lii (+1) the Kernel can be written as $K(x,y)=h(x) \cdot h(y)$ (' $\cdot$ ' represents the inner product.

I will try to give some 'intuitive' explanation of what this $h$ looks like, so this answer is no formal proof, it just wants to give some feeling of how I think that this works. Do not hesitate to correct me if I am wrong.

I have to 'transform' my feature space (so my $x_i$ ) into some 'new' feature space in which the linear separation will be solved.

For each observation $x_i$ , I define functions $\phi_i(x)=K(x_i,x)$ , so I have a function $\phi_i$ for each element of my training sample. These functions $\phi_i$ span a vector space. The vector space spanned by the $\phi_i$ , note it $V=span(\phi_{i, i=1,2,\dots N})$ .

I will try to argue that is the vector space in which linear separation will be possible. By definition of the span, each vector in the vector space $V$ can be written as as a linear combination of the $\phi_i$ , i.e.: $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ are real numbers.

$N$ is the size of the training sample and therefore the dimension of the vector space $V$ can go up to $N$ , depending on whether the $\phi_i$ are linear independent. As $\phi_i(x)=K(x_i,x)$ (see supra, we defined $\phi$ in this way), this means that the dimension of $V$ depends on the kernel used and can go up to the size of the training sample.

The transformation, that maps my original feature space to $V$ is defined as

$\Phi: x_i \to \phi(x)=K(x_i, x)$ .

This map $\Phi$ maps my original feature space onto a vector space that can have a dimension that goed up to the size of my training sample.

Obviously, this transformation (a) depends on the kernel, (b) depends on the values $x_i$ in the training sample and (c) can, depending on my kernel, have a dimension that goes up to the size of my training sample and (d) the vectors of $V$ look like $\sum_{i=1}^N \gamma_i \phi_i$ , where $\gamma_i$ , $\gamma_i$ are real numbers.

Looking at the function $f(x)$ in How to calculate decision boundary from support vectors? it can be seen that $f(x)=\sum_i y_i \alpha_i \phi_i(x)+b$ .

In other words, $f(x)$ is a linear combination of the $\phi_i$ and this is a linear separator in the V-space : it is a particular choice of the $\gamma_i$ namely $\gamma_i=\alpha_i y_i$ !

The $y_i$ are known from our observations, the $\alpha_i$ are the Lagrange multipliers that the SVM has found. In other words SVM find, through the use of a kernel and by solving a quadratic programming problem, a linear separation in the $V$ -spave.

This is my intuitive understanding of how the 'kernel trick' allows one to 'implicitly' transform the original feature space into a new feature space $V$ , with a different dimension. This dimension depends on the kernel you use and for the RBF kernel this dimension can go up to the size of the training sample.

So kernels are a technique that allows SVM to transform your feature space , see also What makes the Gaussian kernel so magical for PCA, and also in general?

— Community
소스

"for each element of my training sample" -- is element here referring to a row or column (i.e. feature )

— user1761806

what is x and x_i? If my X is an input of 5 columns, and 100 rows, what would x and x_i be?

— user1761806

@user1761806 an element is a row. The notation is explained in the link at the beginning of the answer

1

Transform predictors (input data) to a high-dimensional feature space. It is sufficient to just specify the kernel for this step and the data is never explicitly transformed to the feature space. This process is commonly known as the kernel trick.

Let me explain it. The kernel trick is the key here. Consider the case of a Radial Basis Function (RBF) Kernel here. It transforms the input to infinite dimensional space. The transformation of input $x$ to $\phi(x)$ can be represented as shown below (taken from http://www.csie.ntu.edu.tw/~cjlin/talks/kuleuven_svm.pdf)

The input space is finite dimensional but the transformed space is infinite dimensional. Transforming the input to an infinite dimensional space is something that happens as a result of the kernel trick. Here $x$ which is the input and $\phi$ is the transformed input. But $\phi$ is not computed as it is, instead the product $\phi(x_i)^T\phi(x)$ is computed which is just the exponential of the norm between $x_i$ and $x$ .

There is a related question Feature map for the Gaussian kernel to which there is a nice answer /stats//a/69767/86202.

The output or decision function is a function of the kernel matrix $K(x_i,x)=\phi(x_i)^T\phi(x)$ and not of the input $x$ or transformed input $\phi$ directly.

— prashanth
소스

0

Mapping to a higher dimension is merely a trick to solve a problem that is defined in the original dimension; so concerns such as overfitting your data by going into a dimension with too many degrees of freedom are not a byproduct of the mapping process, but are inherent in your problem definition.

Basically, all that mapping does is converting conditional classification in the original dimension to a plane definition in the higher dimension, and because there is a 1 to 1 relationship between the plane in the higher dimension and your conditions in the lower dimension, you can always move between the two.

Taking the problem of overfitting, clearly, you can overfit any set of observations by defining enough conditions to isolate each observation into its own class, which is equivalent of mapping your data to (n-1)D where n is the number of your observations.

Taking the simplest problem, where your observations are [[1,-1], [0,0], [1,1]] [[feature, value]], by moving into the 2D dimension and separating your data with a line, your are simply turning the conditional classification of feature < 1 && feature > -1 : 0 to defining a line that passes through (-1 + epsilon, 1 - epsilon). If you had more data points and needed more condition, you just needed to add one more degree of freedom to your higher dimension by each new condition that your define.

You can replace the process of mapping to a higher dimension with any process that provides you with a 1 to 1 relationship between the conditions and the degrees of freedom of your new problem. Kernel tricks simply do that.

— Hou
소스

1

As a different example, take the problem where the phenomenon results in observations of the form of [x, floor(sin(x))]. Mapping your problem into a 2D dimension is not helpful here at all; in fact, mapping to any plane will not be helpful here, which is because defining the problem as a set of x < a && x > b : z is not helpful in this case. The simplest mapping in this case is mapping into a polar coordinate, or into the imaginary plane.

— Hou

커널 SVM : 더 높은 차원의 피쳐 공간에 대한 매핑에 대한 직관적 인 이해와 이것이 선형 분리를 가능하게하는 방법을 원합니다.

1. 완벽한 분리 달성

2. 선형 분리로서 커널 SVM 학습

3. 기능 공간을 구성하기 위해 커널을 사용하는 방법