Monte Carlo로 KL (Kullback Leibler) 발산 추정

10

두 개의 연속 분포 f와 g 사이의 KL 발산을 추정하고 싶습니다. 그러나 f 또는 g의 밀도를 기록 할 수 없습니다. f와 g 모두에서 일부 방법 (예 : markov chain monte carlo)을 통해 샘플링 할 수 있습니다.

f에서 g 로의 KL 발산은 다음과 같이 정의됩니다

D_{K L} (f | | g) = \int_{- \infty}^{\infty} f (x) \log (\frac{f (x)}{g (x)}) d x

$D_{KL}(f || g) = \int_{-\infty}^{\infty} f(x) \log\left(\frac{f(x)}{g(x)}\right) dx$

이것은 와 관련하여 에 대한 예상이므로 몬테카를로 추정치를 상상할 수 있습니다. $\log\left(\frac{f(x)}{g(x)}\right)$

\frac{1}{N} \sum_{i}^{N} \log (\frac{f (x_{i})}{g (x_{i})})

$\frac{1}{N}\sum_i^N \log\left(\frac{f(x_i)}{g(x_i)}\right)$

Where i indexes N samples that are drawn from f (i.e. $x_i \sim f()$ for i = 1, ..., N)

However, since I don't know f() and g(), I can't even use this monte carlo estimate. What is the standard way of estimating the KL in this situation?

EDIT: I do NOT know the unnormalized density for either f() or g()

kullback-leibler

— frelk
소스

Have you considered using the ecdfs?

— Toby

this will work but it can be arbitrarily slow for hard choice of f and g (close, or close tails). If you decide to ignore samples away from tails then you might have more luck with upper bounding the roc.

— Christian Chapman

Essentially a duplicate: stats.stackexchange.com/questions/211175/…

— kjetil b halvorsen

7

I assume you can evaluate $f$ and $g$ up to a normalizing constant. Denote $f(x) = f_u(x)/c_f$ and $g(x) = g_u(x)/c_g$ .

A consistent estimator that may be used is

\hat{D_{K L}} (f | | g) = {[n^{- 1} \sum_{j} f_{u} (x_{j}) / π_{f} (x_{j})]}^{- 1} \frac{1}{N} \sum_{i}^{N} [\log (\frac{f_{u} (z_{i})}{g_{u} (z_{i})}) \frac{f_{u} (z_{i})}{π_{r} (z_{i})}] - \log (\hat{r})

$\widehat{D_{KL}}(f || g) = \left[n^{-1} \sum_j f_u(x_j)/\pi_f(x_j)\right]^{-1}\frac{1}{N}\sum_i^N \left[\log\left(\frac{f_u(z_i)}{g_u(z_i)}\right)\frac{f_u(z_i)}{\pi_r(z_i)}\right] - \log (\hat{r})$ where

\begin{matrix} (1) & \hat{r} = \frac{1 / n}{1 / n} \frac{\sum_{j} f_{u} (x_{j}) / π_{f} (x_{j})}{\sum_{j} g_{u} (y_{j}) / π_{g} (y_{j})} . \end{matrix}

$\hat{r} = \frac{1/n}{1/n}\frac{\sum_j f_u(x_j)/\pi_f(x_j)}{\sum_j g_u(y_j)/\pi_g(y_j)} \tag{1}.$ is an importance sampling estimator for the ratio

c_{f} / c_{g}

$c_f/c_g$ . Here you use

π_{f}

$\pi_f$ and

π_{g}

$\pi_g$ as instrumental densities for

f_{u}

$f_u$ and

g_{u}

$g_u$ respectively, and

π_{r}

$\pi_r$ to target the log ratio of unnormalized densities.

So let $\{x_i\} \sim \pi_f$ , $\{y_i\} \sim \pi_g$ , and $\{z_i\} \sim \pi_r$ . The numerator of (1) converges to $c_f$ . The denominator converges to $c_g$ . The ratio is consistent by the continuous mapping theorem. The log of the ratio is consistent by continuous mapping again.

Regarding the other part of the estimator,

\frac{1}{N} \sum_{i}^{N} [\log (\frac{f_{u} (z_{i})}{g_{u} (z_{i})}) \frac{f_{u} (z_{i})}{π_{r} (z_{i})}] \overset{as}{\to} c_{f} E [\log (\frac{f_{u} (z_{i})}{g_{u} (z_{i})})]

$\frac{1}{N}\sum_i^N \left[\log\left(\frac{f_u(z_i)}{g_u(z_i)}\right)\frac{f_u(z_i)}{\pi_r(z_i)}\right] \overset{\text{as}}{\to} c_f E\left[ \log\left(\frac{f_u(z_i)}{g_u(z_i)}\right) \right]$ by the law of large numbers.

My motivation is the following:

\begin{aligned} D_{K L} (f | | g) & = \int_{- \infty}^{\infty} f (x) \log (\frac{f (x)}{g (x)}) d x \\ = \int_{- \infty}^{\infty} f (x) {\log [\frac{f_{u} (x)}{g_{u} (x)}] + \log [\frac{c_{g}}{c_{f}}]} d x \\ = E_{f} [\log \frac{f_{u} (x)}{g_{u} (x)}] + \log [\frac{c_{g}}{c_{f}}] \\ = c_{f}^{- 1} E_{π_{r}} [\log \frac{f_{u} (x)}{g_{u} (x)} \frac{f_{u} (x)}{π_{r} (x)}] + \log [\frac{c_{g}}{c_{f}}] . \end{aligned}

$\begin{align*} D_{KL}(f || g) &= \int_{-\infty}^{\infty} f(x) \log\left(\frac{f(x)}{g(x)}\right) dx \\ &= \int_{-\infty}^{\infty} f(x)\left\{ \log \left[\frac{f_u(x)}{g_u(x)} \right] + \log \left[\frac{c_g}{c_f} \right]\right\} dx \\ &= E_f\left[\log \frac{f_u(x)}{g_u(x)} \right] + \log \left[\frac{c_g}{c_f} \right] \\ &= c_f^{-1} E_{\pi_r}\left[\log \frac{f_u(x)}{g_u(x)}\frac{f_u(x)}{\pi_r(x)} \right] + \log \left[\frac{c_g}{c_f} \right]. \end{align*}$ So I just break it up into tractable pieces.

For more ideas on how to simulate the likelhood ratio, I found a paper that has a few: https://projecteuclid.org/download/pdf_1/euclid.aos/1031594732

— Taylor
소스

(+1) It's worth noting here that importance sampling can have extremely high variance (even infinite variance) if the target distribution has fatter tails than the distribution you're sampling from and/or the number of dimensions is at all large.

— David J. Harris

@DavidJ.Harris very very true

— Taylor

6

Here I assume that you can only sample from the models; an unnormalized density function is not available.

You write that

D_{K L} (f | | g) = \int_{- \infty}^{\infty} f (x) \log (\underset{=: r}{\underset{⏟}{\frac{f (x)}{g (x)}}}) d x,

$D_{KL}(f || g) = \int_{-\infty}^{\infty} f(x) \log\left(\underbrace{\frac{f(x)}{g(x)}}_{=: r}\right) dx,$

where I have defined the ratio of probabilities to be $r$ . Alex Smola writes, although in a different context that you can estimate these ratios "easily" by just training a classifier. Let us assume you have obtained a classifier $p(f|x)$ , which can tell you the probability that an observation $x$ has been generated by $f$ . Note that $p(g|x) = 1 - p(f|x)$ . Then:

r = \frac{p (x | f)}{p (x | g)} = \frac{p (f | x) p (x) p (g)}{p (g | x) p (x) p (f)} = \frac{p (f | x)}{p (g | x)},

$r = \frac{p(x|f)}{p(x|g)} \\ = \frac{p(f|x) {p(x) p(g)}}{p(g|x)p(x) p(f)} \\ = \frac{p(f|x)}{p(g|x)},$

where the first step is due to Bayes and the last follows from the assumption that $p(g) = p(f)$ .

Getting such a classifier can be quite easy for two reasons.

First, you can do stochastic updates. That means that if you are using a gradient-based optimizer, as is typical for logistic regression or neural networks, you can just draw a samples from each $f$ and $g$ and make an update.

Second, as you have virtually unlimited data–you can just sample $f$ and $g$ to death–you don't have to worry about overfitting or the like.

— bayerj
소스

0

Besides the probabilistic classifier method mentioned by @bayerj, you can also use the lower bound of the KL divergence derived in [1-2]:

K L [f ‖ g] \geq sup_{T} {E_{x \sim f} [T (x)] - E_{x \sim g} [\exp (T (x) - 1)]},

$\mathrm{KL}[f \Vert g] \ge \sup_{T} \left\{ \mathbb{E}_{x\sim f}\left[ T(x) \right] - \mathbb{E}_{x\sim g} \left[ \exp \left( T(x) - 1 \right)\right] \right\},$ where

T : X \to R

$T:\mathcal{X}\to\mathbb{R}$ is an arbitrary function. Under some mild conditions, the bound is tight for:

T (x) = 1 + \ln [\frac{f (x)}{g (x)}]

$T(x) = 1 + \ln \left[ \frac{f(x)}{g(x)} \right]$

To estimate KL divergence between $f$ and $g$ , we maximize the lower bound w.r.t. to the function $T(x)$ .

References:

[1] Nguyen, X., Wainwright, M.J. and Jordan, M.I., 2010. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), pp.5847-5861.

[2] Nowozin, S., Cseke, B. and Tomioka, R., 2016. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems (pp. 271-279).

— Cuong
소스