기대 극대화의 동기 부여 알고리즘

20

EM 알고리즘 접근법에서 Jensen의 부등식을 사용하여

\log p (x | θ) \geq \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z - \int \log p (z | x, θ) p (z | x, θ^{(k)}) d z

$\log p(x|\theta) \geq \int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz - \int \log p(z|x,\theta) p(z|x,\theta^{(k)})dz$

$\theta^{(k+1)}$

θ^{(k + 1)} = \arg max_{θ} \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z

$\theta^{(k+1)}=\arg \max_{\theta}\int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz$

EM에서 읽은 모든 내용은 그냥 쓰러지지 만 EM 알고리즘이 자연적으로 발생하는 이유에 대한 설명이 없기 때문에 항상 불안했습니다. 나는 우도는 일반적으로 곱셈 대신 덧셈을 다루기 위해 처리되지만 정의에서 의 출현은 나에게 동기가 없다고 생각합니다. 왜 다른 단조 함수가 아닌 고려해야 합니까? 여러 가지 이유로 나는 기대 극대화의 배후에있는 "의미"또는 "동기 부여"가 정보 이론과 충분한 통계의 관점에서 어떤 종류의 설명을 가지고 있다고 생각합니다. 추상적 인 알고리즘보다 훨씬 더 만족스러운 설명이 있다면. $\log$ $\log$ $\theta^{(k+1)}$ $\log$

mixture expectation-maximization

— 사용자
소스

3

기대 최대화 알고리즘은 무엇입니까? , Nature Biotechnology 26 : 897–899 (2008)에는 알고리즘의 작동 방식을 보여주는 멋진 그림이 있습니다.

— chl

@chl : 나는 그 기사를 보았다. 내가 묻는 요점은 왜 비 로그 접근 방식이 작동하지 않는지 설명 할 수 없다는 것입니다.

— user782220

10

EM 알고리즘은 해석이 다르며 응용 프로그램마다 다른 형태로 발생할 수 있습니다.

그것은 모두 우도 함수 $p(x \vert \theta)$ 또는 동등 하게 최대화하려는 우도 함수 $\log p(x \vert \theta)$ 합니다. (우리는 일반적으로 그 계산을 단순화로 로그를 사용하기 때문에 엄격하게 단조 오목하고, $\log(ab) = \log a + \log b$ .) 이상적인 세상에서의 값 $p$ 단지에 따라 모델 파라미터 $\theta$ 이므로 우리는 의 공간을 검색 $\theta$ 하고 최대화하는 것을 찾을 수 있습니다 $p$ .

그러나 많은 흥미로운 실제 응용에서는 모든 변수가 관찰되지 않기 때문에 상황이 더 복잡합니다. 예, 우리는 직접 관찰 할 수 $x$ 있지만 일부 다른 변수 $z$ 는 관찰되지 않습니다. 때문에의 실종 변수 $z$ 없이 : 우리는 닭이 계란 상황의 종류에 $z$ 우리는 매개 변수 추정 할 수 $\theta$ 하고하지 않고 $\theta$ 우리의 가치 무엇을 추론 할 수 $z$ 될 수있다.

EM 알고리즘이 사용되는 곳입니다. 모델 파라미터 의 초기 추측으로 시작 $\theta$ 하여 누락 된 변수 의 예상 값 $z$ (즉, E 단계)을 도출합니다 . 값이 있으면 $z$ 모수 $\theta$ (즉, 문제 설명 의 $\arg \max$ 방정식에 해당하는 M 단계) 의 가능성을 최대화 할 수 있습니다 . 이 $\theta$ 를 사용하여 새로운 기대 값 $z$ (다른 E 단계) 등을 도출 할 수 있습니다 . 다시 말해, 각 단계에서 $z$ 와 둘 중 하나를 가정합니다. $\theta$ 알려진다. 더 이상 가능성을 높일 수 없을 때까지이 반복 프로세스를 반복합니다.

이것은 간단히 말해서 EM 알고리즘입니다. 이 반복 EM 프로세스 동안 가능성이 결코 감소하지 않을 것임은 잘 알려져있다. 그러나 EM 알고리즘은 전 세계 최적을 보장하지는 않습니다. 즉, 우도 함수의 국소 최적으로 끝날 수 있습니다.

방정식에서 의 출현 은 불가피합니다. 여기서 최대화하려는 함수가 로그 우도로 작성되기 때문입니다. $\log$ $\theta^{(k+1)}$

— 웨이웨이
소스

이것이 어떻게 질문에 대답하는지 모르겠습니다.

— broncoAbierto

9

가능성과 로그 가능성

이미 말했듯이 는 일반적으로 제품보다 합계를 최적화하는 것이 더 쉽기 때문에 최대 가능성으로 도입됩니다. 우리가 다른 단조 함수를 고려하지 않는 이유는 로그가 곱을 곱하는 속성을 가진 고유 한 함수 이기 때문 입니다. $\log$

대수 동기를 부여하는 또 다른 방법은 다음과 같다 : 대신에 우리의 모델에 따라 데이터의 확률을 극대화, 우리는 동등 최소화하기 위해 시도 할 수 쿨백 - 라이 블러 발산 데이터 분산 사이, , 및 모델 분포 , $p_\text{data}(x)$ $p(x \mid \theta)$

D_{KL} [p_{data} (x) ∣∣ p (x ∣ θ)] = \int p_{data} (x) \log \frac{p_{data} (x)}{p (x ∣ θ)} d x = c o n s t - \int p_{data} (x) \log p (x ∣ θ) d x .

$D_\text{KL}[p_\text{data}(x) \mid\mid p(x \mid \theta)] = \int p_\text{data}(x) \log \frac{p_\text{data}(x)}{p(x \mid \theta)} \, dx = const - \int p_\text{data}(x)\log p(x \mid \theta) \, dx.$

오른쪽의 첫 번째 항은 매개 변수에서 일정합니다. 우리가 있다면 데이터 분포 (우리의 데이타 포인트)의 샘플을, 우리는 두 번째 항에 근사 할 수 있는 데이터의 평균 로그 우도와 $N$

\int p_{data} (x) \log p (x ∣ θ) d x \approx \frac{1}{N} \sum_{n} \log p (x_{n} ∣ θ) .

$\int p_\text{data}(x)\log p(x \mid \theta) \, dx \approx \frac{1}{N} \sum_n \log p(x_n \mid \theta).$

EM의 다른 견해

나는 이것이 당신이 찾고있는 일종의 설명이 될지 확신하지 못하지만 Jensen의 불평등을 통한 동기 부여보다 기대 극대화의 다음과 같은 관점이 훨씬 더 깨달음을 발견했습니다 (Neal & Hinton (1998) 에서 자세한 설명을 찾을 수 있습니다 ) 또는 Chris Bishop의 PRML 서적, 9.3 장).

그것을 보여주는 것은 어렵지 않습니다

\log p (x ∣ θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z + D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)]

$\log p(x \mid \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz + D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)]$

어떤 대한 . 오른쪽 의 첫 번째 항을 호출하면 다음을 의미합니다. $q(z \mid x)$ $F(q, \theta)$

F (q, θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z = \log p (x ∣ θ) - D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)] .

$F(q, \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz = \log p(x \mid \theta) - D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)].$

때문에 KL 발산 항상 긍정적 , 모든 고정에 대한 로그 우도에 바인딩 낮은 . 이제 EM은 와 대해 교대로 를 최대화하는 것으로 볼 수 있습니다 . 특히, E- 스텝에서 를 설정하면 오른쪽의 KL 발산을 최소화하여 를 최대화 합니다. $F(q, \theta)$ $q$ $F$ $q$ $\theta$ $q(z \mid x) = p(z \mid x, \theta)$ $F$

— 루카스
소스

Thanks for the post! Though the given document doesn't say logarithm is the unique function turning products into sums. It says logarithm is the only function that fulfills all three listed properties at the same time.

— Weiwei

@Weiwei: Right, but the first condition mainly requires that the function is invertible. Of course, f(x) = 0 also implies f(x + y) = f(x)f(y), but this is an uninteresting case. The third condition asks that the derivative at 1 is 1, which is only true for the logarithm to base

e

$e$ . Drop this constraint and you get logarithms to different bases, but still logarithms.

— Lucas

4

The paper that I found clarifying with respect to expectation-maximization is Bayesian K-Means as a "Maximization-Expectation" Algorithm (pdf) by Welling and Kurihara.

Suppose we have a probabilistic model $p(x,z,\theta)$ with $x$ observations, $z$ hidden random variables, and a total of $\theta$ parameters. We are given a dataset $D$ and are forced (by higher powers) to establish $p(z,\theta|D)$ .

1. Gibbs sampling

We can approximate $p(z,\theta|D)$ by sampling. Gibbs sampling gives $p(z,\theta|D)$ by alternating:

θ \sim p (θ | z, D) z \sim p (z | θ, D)

$\theta \sim p(\theta|z,D) \\ z \sim p(z|\theta,D)$

2. Variational Bayes

Instead, we can try to establish a distribution $q(\theta)$ and $q(z)$ and minimize the difference with the distribution we are after $p(\theta,z|D)$ . The difference between distributions has a convenient fancy name, the KL-divergence. To minimize $KL[q(\theta)q(z)||p(\theta,z|D)]$ we update:

q (θ) \propto \exp (E [\log p (θ, z, D)]_{q (z)}) q (z) \propto \exp (E [\log p (θ, z, D)]_{q (θ)})

$q(\theta) \propto \exp (E [\log p(\theta,z,D) ]_{q(z)} ) \\ q(z) \propto \exp (E [\log p(\theta,z,D) ]_{q(\theta)} )$

3. Expectation-Maximization

To come up with full-fledged probability distributions for both $z$ and $\theta$ might be considered extreme. Why don't we instead consider a point estimate for one of these and keep the other nice and nuanced. In EM the parameter $\theta$ is established as the one being unworthy of a full distribution, and set to its MAP (Maximum A Posteriori) value, $\theta^*$ .

θ^{*} = \underset{θ}{argmax} E [\log p (θ, z, D)]_{q (z)} q (z) = p (z | θ^{*}, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(z)} \\ q(z) = p(z|\theta^*,D)$

Here $\theta^* \in \operatorname{argmax}$ would actually be a better notation: the argmax operator can return multiple values. But let's not nitpick. Compared to variational Bayes you see that correcting for the $\log$ by $\exp$ doesn't change the result, so that is not necessary anymore.

4. Maximization-Expectation

There is no reason to treat $z$ as a spoiled child. We can just as well use point estimates $z^*$ for our hidden variables and give the parameters $\theta$ the luxury of a full distribution.

z^{*} = \underset{z}{argmax} E [\log p (θ, z, D)]_{q (θ)} q (θ) = p (θ | z^{*}, D)

$z^* = \underset{z}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(\theta)} \\ q(\theta) = p(\theta|z^*,D)$

If our hidden variables $z$ are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).

5. Iterated conditional modes

Of course, the poster child of approximate inference is to use point estimates for both the parameters $\theta$ as well as the observations $z$ .

θ^{*} = \underset{θ}{argmax} p (θ, z^{*}, D) z^{*} = \underset{z}{argmax} p (θ^{*}, z, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} p(\theta,z^*,D) \\ z^* = \underset{z}{\operatorname{argmax}} p(\theta^*,z,D) \\$

To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a $k$ -means alternative, but this lucid and concise exposition of approximation.

— Anne van Rossum
소스

(+1) this is a beautiful summary of all methods.

— kedarps

4

There is a useful optimisation technique underlying the EM algorithm. However, it's usually expressed in the language of probability theory so it's hard to see that at the core is a method that has nothing to do with probability and expectation.

Consider the problem of maximising

g (x) = \sum_{i} \exp (f_{i} (x))

$g(x)=\sum_i\exp(f_i(x))$ (or equivalently

\log g (x)

$\log g(x)$ ) with respect to

x

$x$ . If you write down an expression for

g^{'} (x)

$g'(x)$ and set it equal to zero you will often end up with a transcendental equation to solve. These can be nasty.

Now suppose that the $f_i$ play well together in the sense that linear combinations of them give you something easy to optimise. For example, if all of the $f_i(x)$ are quadratic in $x$ then a linear combination of the $f_i(x)$ will also be quadratic, and hence easy to optimise.

Given this supposition, it'd be cool if, in order to optimise $\log g(x)=\log \sum_i\exp(f_i(x))$ we could somehow shuffle the $\log$ past the $\sum$ so it could meet the $\exp$ s and eliminate them. Then the $f_i$ could play together. But we can't do that.

Let's do the next best thing. We'll make another function $h$ that is similar to $g$ . And we'll make it out of linear combinations of the $f_i$ .

Let's say $x_0$ is a guess for an optimal value. We'd like to improve this. Let's find another function $h$ that matches $g$ and its derivative at $x_0$ , i.e. $g(x_0)=h(x_0)$ and $g'(x_0)=h'(x_0)$ . If you plot a graph of $h$ in a small neighbourhood of $x_0$ it's going to look similar to $g$ .

You can show that

g^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x)) .

$g'(x)=\sum_i f_i'(x)\exp(f_i(x)).$ We want something that matches this at

x_{0}

$x_0$ . There's a natural choice:

h (x) = constant + \sum_{i} f_{i} (x) \exp (f_{i} (x_{0})) .

$h(x)=\mbox{constant}+\sum_i f_i(x)\exp(f_i(x_0)).$ You can see they match at

x = x_{0}

$x=x_0$ . We get

h^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x_{0})) .

$h'(x)=\sum_i f_i'(x)\exp(f_i(x_0)).$ As

x_{0}

$x_0$ is a constant we have a simple linear combination of the

f_{i}

$f_i$ whose derivative matches

g

$g$ . We just have to choose the constant in

h

$h$ to make

g (x_{0}) = h (x_{0})

$g(x_0)=h(x_0)$ .

So starting with $x_0$ , we form $h(x)$ and optimise that. Because it's similar to $g(x)$ in the neighbourhood of $x_0$ we hope the optimum of $h$ is similar to the optimum of g. Once you have a new estimate, construct the next $h$ and repeat.

I hope this has motivated the choice of $h$ . This is exactly the procedure that takes place in EM.

But there's one more important point. Using Jensen's inequality you can show that $h(x)\le g(x)$ . This means that when you optimise $h(x)$ you always get an $x$ that makes $g$ bigger compared to $g(x_0)$ . So even though $h$ was motivated by its local similarity to $g$ , it's safe to globally maximise $h$ at each iteration. The hope I mentioned above isn't required.

This also gives a clue to when to use EM: when linear combinations of the arguments to the $\exp$ function are easier to optimise. For example when they're quadratic - as happens when working with mixtures of Gaussians. This is particularly relevant to statistics where many of the standard distributions are from exponential families.

— Dan Piponi
소스

3

As you said, I will not go into technical details. There are quite a few very nice tutorials. One of my favourites are Andrew Ng's lecture notes. Take a look also at the references here.

EM is naturally motivated in mixture models and models with hidden factors in general. Take for example the case of Gaussian mixture models (GMM). Here we model the density of the observations as a weighted sum of $K$ gaussians:
$p (x) = \sum_{i = 1}^{K} π_{i} N (x | μ_{i}, Σ_{i})$ $p(x) = \sum_{i=1}^{K}\pi_{i} \mathcal{N}(x|\mu_{i}, \Sigma_{i})$ where $\pi_{i}$ is the probability that the sample $x$ was caused/generated by the ith component, $\mu_{i}$ is the mean of the distribution, and $\Sigma_{i}$ is the covariance matrix. The way to understand this expression is the following: each data sample has been generated/caused by one component, but we do not know which one. The approach is then to express the uncertainty in terms of probability ( $\pi_{i}$ represents the chances that the ith component can account for that sample), and take the weighted sum. As a concrete example, imagine you want to cluster text documents. The idea is to assume that each document belong to a topic (science, sports,...) which you do not know beforehand!. The possible topics are hidden variables. Then you are given a bunch of documents, and by counting n-grams or whatever features you extract, you want to then find those clusters and see to which cluster each document belongs to. EM is a procedure which attacks this problem step-wise: the expectation step attempts to improve the assignments of the samples it has achieved so far. The maximization step you improve the parameters of the mixture, in other words, the form of the clusters.
The point is not using monotonic functions but convex functions. And the reason is the Jensen's inequality which ensures that the estimates of the EM algorithm will improve at every step.

— jpmuc
소스