Fisher 정보는 어떤 종류의 정보입니까?

랜덤 변수 $X \sim f(x|\theta)$ 가 있다고 가정 해 봅시다 . 경우 $\theta_0$ 실제 파라미터 있었다 상기 우도 함수를 최대화 제로 유도체 같아야한다. 이것이 최대 가능성 추정기의 기본 원리입니다.

내가 알기로 Fisher 정보는 다음과 같이 정의됩니다.

I (θ) = E [{(\frac{\partial}{\partial θ} f (X | θ))}^{2}]

$I(\theta) = \Bbb E \Bigg[\left(\frac{\partial}{\partial \theta}f(X|\theta)\right)^2\Bigg ]$

따라서 $\theta_0$ 이 참 매개 변수이면 $I(\theta) = 0$ 입니다. 그러나 $\theta_0$ 이 참 매개 변수가 아닌 경우 더 많은 Fisher 정보를 갖게됩니다.

내 질문

Fisher 정보가 주어진 MLE의 "오류"를 측정합니까? 다시 말해서, 긍정적 인 Fisher 정보가 존재한다고 MLE가 이상적이지 않다는 것을 의미하지 않습니까?
"정보"에 대한이 정의는 Shannon이 사용하는 것과 어떻게 다릅니 까? 정보라고 부르는 이유는 무엇입니까?

— Stan Shunpike
소스

Why do you write it

E_{θ}

$E_\theta$ ? The expectation is over values of

X

$X$ distributed as if they came from your distribution with parameter

θ

$\theta$ .

— Neil G

Also

I (θ)

$I(\theta)$ is not zero at the true parameter.

— Neil G

E (S)는 0 (즉, 점수 함수의 기대)이지만 Neil G가 쓴 것처럼 피셔 정보 (V (S))는 (보통) 0이 아닙니다.

— Tal Galili

답변:

ℓ (θ) = \log f (x; θ)

$\ell (\theta) = \log f(x;\theta)$ as a function of

θ

$\theta$ for

θ \in Θ

$\theta \in \Theta$ , the parameter space. Assuming some regularity conditions we do not discuss here, we have

E \frac{\partial}{\partial θ} ℓ (θ) = E_{θ} \dot{ℓ} (θ) = 0

$\DeclareMathOperator{\E}{\mathbb{E}} \E \frac{\partial}{\partial \theta} \ell (\theta) = \E_\theta \dot{\ell}(\theta) = 0$ (we will write derivatives with respect to the parameter as dots as here). The variance is the Fisher information

I (θ) = E_{θ} (\dot{ℓ} (θ))^{2} = - E_{θ} \ddot{ℓ} (θ)

$I(\theta) = \E_\theta ( \dot{\ell}(\theta) )^2= -\E_\theta \ddot{\ell}(\theta)$ the last formula showing that it is the (negative) curvature of the loglikelihood function. One often finds the maximum likelihood estimator (mle) of

θ

$\theta$ by solving the likelihood equation

\dot{ℓ} (θ) = 0

$\dot{\ell}(\theta)=0$ when the Fisher information as the variance of the score

\dot{ℓ} (θ)

$\dot{\ell}(\theta)$ is large, then the solution to that equation will be very sensitive to the data, giving a hope for high precision of the mle. That is confirmed at least asymptotically, the asymptotic variance of the mle being the inverse of Fisher information.

How can we interpret this? $\ell(\theta)$ is the likelihood information about the parameter $\theta$ from the sample. This can really only be interpreted in a relative sense, like when we use it to compare the plausibilities of two distinct possible parameter values via the likelihood ratio test $\ell(\theta_0) - \ell(\theta_1)$ . The rate of change of the loglikelihood is the score function $\dot{\ell}(\theta)$ tells us how fast the likelihood changes, and its variance $I(\theta)$ how much this varies from sample to sample, at a given paramiter value, say $\theta_0$ . The equation (which is really surprising!)

I (θ) = - E_{θ} \ddot{ℓ} (θ)

$I(\theta) = - \E_\theta \ddot{\ell}(\theta)$ tells us there is a relationsship (equality) between the variability in the information (likelihood) for a given parameter value,

θ_{0}

$\theta_0$ , and the curvature of the likelihood function for that parameter value. This is a surprising relationship between the variability (variance) of ths statistic

\dot{ℓ} (θ) ∣_{θ = θ_{0}}

$\dot{\ell}(\theta) \mid_{\theta=\theta_0}$ and the expected change in likehood when we vary the parameter

θ

$\theta$ in some interval around

θ_{0}

$\theta_0$ (for the same data). This is really both strange, surprising and powerful!

So what is the likelihood function? We usually think of the statistical model $\{ f(x;\theta), \theta \in \Theta \}$ as a family of probability distributions for data $x$ , indexed by the parameter $\theta$ some element in the parameter space $\Theta$ . We think of this model as being true if there exist some value $\theta_0 \in \Theta$ such that the data $x$ actually have the probability distribution $f(x;\theta_0)$ . So we get a statistical model by imbedding the true datagenerating probability distribution $f(x;\theta_0)$ in a family of probability distributions. But, it is clear that such an imbedding can be done in many different ways, and each such imbedding will be a "true" model, and they will give different likelihood functions. And, without such an imbedding, there is no likelihood function. It seems that we really do need some help, some principles for how to choose an imbedding wisely!

So, what does this mean? It means that the choice of likelihood function tells us how we would expect the data to change, if the truth changed a little bit. But, this cannot really be verified by the data, as the data only gives information about the true model function $f(x;\theta_0)$ which actually generated the data, and not nothing about all the other elements in the choosen model. This way we see that choice of likelihood function is similar to choice of a prior in Bayesian analysis, it injects non-data information into the analysis. Let us look at this in a simple (somewhat artificial) example, and look at the effect of imbedding $f(x;\theta_0)$ in a model in different ways.

Let us assume that $X_1, \dotsc, X_n$ are iid as $N(\mu=10, \sigma^2=1)$ . So, that is the true, data-generating distribution. Now, let us embed this in a model in two different ways, model A and model B.

A : X_{1}, \dots, X_{n} iid N (μ, σ^{2} = 1), μ \in R B : X_{1}, \dots, X_{n} iid N (μ, μ / 10), μ > 0

$A \colon X_1, \dotsc, X_n ~\text{iid}~N(\mu, \sigma^2=1),\mu \in \mathbb{R} \\ B \colon X_1, \dotsc, X_n ~\text{iid}~N(\mu, \mu/10), \mu>0$ you can check that this coincides for

μ = 10

$\mu=10$ .

The loglikelihood functions become

ℓ_{A} (μ) = - \frac{n}{2} \log (2 π) - \frac{1}{2} \sum_{i} (x_{i} - μ)^{2} ℓ_{B} (μ) = - \frac{n}{2} \log (2 π) - \frac{n}{2} \log (μ / 10) - \frac{10}{2} \sum_{i} \frac{(x_{i} - μ)^{2}}{μ}

$\ell_A(\mu) = -\frac{n}{2} \log (2\pi) -\frac12\sum_i (x_i-\mu)^2 \\ \ell_B(\mu) = -\frac{n}{2} \log (2\pi) - \frac{n}{2}\log(\mu/10) - \frac{10}{2}\sum_i \frac{(x_i-\mu)^2}{\mu}$

The score functions: (loglikelihood derivatives):

{\dot{ℓ}}_{A} (μ) = n (\bar{x} - μ) {\dot{ℓ}}_{B} (μ) = - \frac{n}{2 μ} - \frac{10}{2} \sum_{i} (\frac{x_{i}}{μ})^{2} - 15 n

$\dot{\ell}_A(\mu) = n (\bar{x}-\mu) \\ \dot{\ell}_B(\mu) = -\frac{n}{2\mu}- \frac{10}{2}\sum_i (\frac{x_i}{\mu})^2 - 15 n$ and the curvatures

{\ddot{ℓ}}_{A} (μ) = - n {\ddot{ℓ}}_{B} (μ) = \frac{n}{2 μ^{2}} + \frac{10}{2} \sum_{i} \frac{2 x_{i}^{2}}{μ^{3}}

$\ddot{\ell}_A(\mu) = -n \\ \ddot{\ell}_B(\mu) = \frac{n}{2\mu^2} + \frac{10}{2}\sum_i \frac{2 x_i^2}{\mu^3}$ so, the Fisher information do really depend on the imbedding. Now, we calculate the Fisher information at the true value

μ = 10

$\mu=10$ ,

I_{A} (μ = 10) = n, I_{B} (μ = 10) = n \cdot (\frac{1}{200} + \frac{2020}{2000}) > n

$I_A(\mu=10) = n, \\ I_B(\mu=10) = n \cdot (\frac1{200}+\frac{2020}{2000}) > n$ so the Fisher information about the parameter is somewhat larger in model B.

This illustrates that, in some sense, the Fisher information tells us how fast the information from the data about the parameter would have changed if the governing parameter changed in the way postulated by the imbedding in a model family. The explanation of higher information in model B is that our model family B postulates that if the expectation would have increased, then the variance too would have increased. So that, under model B, the sample variance will also carry information about $\mu$ , which it will not do under model A.

Also, this example illustrates that we really do need some theory for helping us in how to construct model families.

— kjetil b halvorsen
소스

great explanation. Why do you say

\E_{θ} \dot{ℓ} (θ) = 0

$\E_\theta \dot{\ell}(\theta) =0$ ? it's a function of

θ

$\theta$ - isn't it 0 only when evaluated at the true parameter

θ_{0}

$\theta_0$ ?

— ihadanny

Yes, what you say is true, @idadanny It is zero when evaluated at the true parameter value.

— kjetil b halvorsen

Thanks again @kjetil - so just one more question: is the surprising relationship between the variance of the score and the curvature of the likelihood true for every

θ

$\theta$ ? or only in the neighborhood of the true parameter

θ_{0}

$\theta_0$ ?

— ihadanny

Again, that trelationship is true for the true parameter value. But for that to be of much help, there must be continuity, so that it is approximately true in some neighborhood, since we will use it at the estimated value

\hat{θ}

$\hat{\theta}$ , not only at the true (unknown) value.

— kjetil b halvorsen

so, the relationship holds for the true parameter

θ_{0}

$\theta_0$ , it almost holds for

θ_{m l e}

$\theta_{mle}$ since we assume that it's in the neighborhood of

θ_{0}

$\theta_0$ , but for a general

θ_{1}

$\theta_1$ it does not hold, right?

— ihadanny

Let's think in terms of the negative log-likelihood function $\ell$ . The negative score is its gradient with respect to the parameter value. At the true parameter, the score is zero. Otherwise, it gives the direction towards the minimum $\ell$ (or in the case of non-convex $\ell$ , a saddle point or local minimum or maximum).

The Fisher information measures the curvature of $\ell$ around $\theta$ if the data follows $\theta$ . In other words, it tells you how much wiggling the parameter would affect your log-likelihood.

Consider that you had a big model with millions of parameters. And you had a small thumb drive on which to store your model. How should you prioritize how many bits of each parameter to store? The right answer is to allocate bits according the Fisher information (Rissanen wrote about this). If the Fisher information of a parameter is zero, that parameter doesn't matter.

We call it "information" because the Fisher information measures how much this parameter tells us about the data.

A colloquial way to think about it is this: Suppose the parameters are driving a car, and the data is in the back seat correcting the driver. The annoyingness of the data is the Fisher information. If the data lets the driver drive, the Fisher information is zero; if the data is constantly making corrections, it's big. In this sense, the Fisher information is the amount of information going from the data to the parameters.

Consider what happens if you make the steering wheel more sensitive. This is equivalent to a reparametrization. In that case, the data doesn't want to be so loud for fear of the car oversteering. This kind of reparametrization decreases the Fisher information.

— Neil G
소스

Complementary to @NeilG's nice answer (+1) and to address your specific questions:

I would say it counts the "precision" rather than the "error" itself.

Remember that the Hessian of the log-likelihood evaluated at the ML estimates is the observed Fisher information. The estimated standard errors are the square roots of the diagonal elements of the inverse of the observed Fisher information matrix. Stemming from this the Fisher information is the trace of the Fisher information matrix. Given that the Fisher Information matrix $I$ is a Hermitian positive-semidefinite matrix matrix then the diagonal entries $I_{j,j}$ of it are real and non-negative; as a direct consequence it trace $tr(I)$ must be positive. This means that you can have only "non-ideal" estimators according to your assertion. So no, a positive Fisher information is not related to how ideal is your MLE.

The definition differs in the way we interpreter the notion of information in both cases. Having said that, the two measurements are closely related.

The inverse of Fisher information is the minimum variance of an unbiased estimator (Cramér–Rao bound). In that sense the information matrix indicates how much information about the estimated coefficients is contained in the data. On the contrary the Shannon entropy was taken from thermodynamics. It relates the information content of a particular value of a variable as $–p·log_2(p)$ where $p$ is the probability of the variable taking on the value. Both are measurements of how "informative" a variable is. In the first case though you judge this information in terms of precision while in the second case in terms of disorder; different sides, same coin! :D

To recap: The inverse of the Fisher information matrix $I$ evaluated at the ML estimator values is the asymptotic or approximate covariance matrix. As this ML estimator values are found in a local minimum graphically the Fisher information shows how deep is that minimum and who much wiggle room you have around it. I found this paper by Lutwak et al. on Extensions of Fisher information and Stam’s inequality an informative read on this matter. The Wikipedia articles on the Fisher Information Metric and on Jensen–Shannon divergence are also good to get you started.

— usεr11852 says Reinstate Monic
소스