Monte Carlo로 KL (Kullback Leibler) 발산 추정


10

두 개의 연속 분포 f와 g 사이의 KL 발산을 추정하고 싶습니다. 그러나 f 또는 g의 밀도를 기록 할 수 없습니다. f와 g 모두에서 일부 방법 (예 : markov chain monte carlo)을 통해 샘플링 할 수 있습니다.

f에서 g 로의 KL 발산은 다음과 같이 정의됩니다

DKL(f||g)=f(x)log(f(x)g(x))dx

이것은 와 관련하여 에 대한 예상이므로 몬테카를로 추정치를 상상할 수 있습니다.log(f(x)g(x))

1NiNlog(f(xi)g(xi))

Where i indexes N samples that are drawn from f (i.e. xif() for i = 1, ..., N)

However, since I don't know f() and g(), I can't even use this monte carlo estimate. What is the standard way of estimating the KL in this situation?

EDIT: I do NOT know the unnormalized density for either f() or g()


Have you considered using the ecdfs?
Toby

this will work but it can be arbitrarily slow for hard choice of f and g (close, or close tails). If you decide to ignore samples away from tails then you might have more luck with upper bounding the roc.
Christian Chapman

답변:


7

I assume you can evaluate f and g up to a normalizing constant. Denote f(x)=fu(x)/cf and g(x)=gu(x)/cg.

A consistent estimator that may be used is

DKL^(f||g)=[n1jfu(xj)/πf(xj)]11NiN[log(fu(zi)gu(zi))fu(zi)πr(zi)]log(r^)
where
(1)r^=1/n1/njfu(xj)/πf(xj)jgu(yj)/πg(yj).
is an importance sampling estimator for the ratio cf/cg. Here you use πf and πg as instrumental densities for fu and gu respectively, and πr to target the log ratio of unnormalized densities.

So let {xi}πf, {yi}πg, and {zi}πr. The numerator of (1) converges to cf. The denominator converges to cg. The ratio is consistent by the continuous mapping theorem. The log of the ratio is consistent by continuous mapping again.

Regarding the other part of the estimator,

1NiN[log(fu(zi)gu(zi))fu(zi)πr(zi)]ascfE[log(fu(zi)gu(zi))]
by the law of large numbers.

My motivation is the following:

DKL(f||g)=f(x)log(f(x)g(x))dx=f(x){log[fu(x)gu(x)]+log[cgcf]}dx=Ef[logfu(x)gu(x)]+log[cgcf]=cf1Eπr[logfu(x)gu(x)fu(x)πr(x)]+log[cgcf].
So I just break it up into tractable pieces.

For more ideas on how to simulate the likelhood ratio, I found a paper that has a few: https://projecteuclid.org/download/pdf_1/euclid.aos/1031594732


(+1) It's worth noting here that importance sampling can have extremely high variance (even infinite variance) if the target distribution has fatter tails than the distribution you're sampling from and/or the number of dimensions is at all large.
David J. Harris

@DavidJ.Harris very very true
Taylor

6

Here I assume that you can only sample from the models; an unnormalized density function is not available.

You write that

DKL(f||g)=f(x)log(f(x)g(x)=:r)dx,

where I have defined the ratio of probabilities to be r. Alex Smola writes, although in a different context that you can estimate these ratios "easily" by just training a classifier. Let us assume you have obtained a classifier p(f|x), which can tell you the probability that an observation x has been generated by f. Note that p(g|x)=1p(f|x). Then:

r=p(x|f)p(x|g)=p(f|x)p(x)p(g)p(g|x)p(x)p(f)=p(f|x)p(g|x),

where the first step is due to Bayes and the last follows from the assumption that p(g)=p(f).

Getting such a classifier can be quite easy for two reasons.

First, you can do stochastic updates. That means that if you are using a gradient-based optimizer, as is typical for logistic regression or neural networks, you can just draw a samples from each f and g and make an update.

Second, as you have virtually unlimited data–you can just sample f and g to death–you don't have to worry about overfitting or the like.


0

Besides the probabilistic classifier method mentioned by @bayerj, you can also use the lower bound of the KL divergence derived in [1-2]:

KL[fg]supT{Exf[T(x)]Exg[exp(T(x)1)]},
where T:XR is an arbitrary function. Under some mild conditions, the bound is tight for:
T(x)=1+ln[f(x)g(x)]

To estimate KL divergence between f and g, we maximize the lower bound w.r.t. to the function T(x).

References:

[1] Nguyen, X., Wainwright, M.J. and Jordan, M.I., 2010. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), pp.5847-5861.

[2] Nowozin, S., Cseke, B. and Tomioka, R., 2016. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems (pp. 271-279).

당사 사이트를 사용함과 동시에 당사의 쿠키 정책개인정보 보호정책을 읽고 이해하였음을 인정하는 것으로 간주합니다.
Licensed under cc by-sa 3.0 with attribution required.