ℓ(θ)=logf(x;θ)
as a function of
θ for
θ∈Θ, the parameter space.
Assuming some regularity conditions we do not discuss here, we have
E∂∂θℓ(θ)=Eθℓ˙(θ)=0 (we will write derivatives with respect to the parameter as dots as here). The variance is the Fisher information
I(θ)=Eθ(ℓ˙(θ))2=−Eθℓ¨(θ)
the last formula showing that it is the (negative) curvature of the loglikelihood function. One often finds the maximum likelihood estimator (mle) of
θ by solving the likelihood equation
ℓ˙(θ)=0 when the Fisher information as the variance of the score
ℓ˙(θ) is large, then the solution to that equation will be very sensitive to the data, giving a hope for high precision of the mle. That is confirmed at least asymptotically, the asymptotic variance of the mle being the inverse of Fisher information.
How can we interpret this? ℓ(θ) is the likelihood information about the parameter θ from the sample. This can really only be interpreted in a relative sense, like when we use it to compare the plausibilities of two distinct possible parameter values via the likelihood ratio test ℓ(θ0)−ℓ(θ1). The rate of change of the loglikelihood is the score function ℓ˙(θ) tells us how fast the likelihood changes, and its variance I(θ) how much this varies from sample to sample, at a given paramiter value, say θ0. The equation (which is really surprising!)
I(θ)=−Eθℓ¨(θ)
tells us there is a relationsship (equality) between the variability in the information (likelihood) for a given parameter value,
θ0, and the curvature of the likelihood function for that parameter value. This is a surprising relationship between the variability (variance) of ths statistic
ℓ˙(θ)∣θ=θ0 and the expected change in likehood when we vary the parameter
θ in some interval around
θ0 (for the same data). This is really both strange, surprising and powerful!
So what is the likelihood function? We usually think of the statistical model {f(x;θ),θ∈Θ} as a family of probability distributions for data x, indexed by the parameter θ some element in the parameter space Θ. We think of this model as being true if there exist some value θ0∈Θ such that the data x actually have the probability distribution f(x;θ0). So we get a statistical model by imbedding the true datagenerating probability distribution f(x;θ0) in a family of probability distributions. But, it is clear that such an imbedding can be done in many different ways, and each such imbedding will be a "true" model, and they will give different likelihood functions. And, without such an imbedding, there is no likelihood function. It seems that we really do need some help, some principles for how to choose an imbedding wisely!
So, what does this mean? It means that the choice of likelihood function tells us how we would expect the data to change, if the truth changed a little bit. But, this cannot really be verified by the data, as the data only gives information about the true model function f(x;θ0) which actually generated the data, and not nothing about all the other elements in the choosen model. This way we see that choice of likelihood function is similar to choice of a prior in Bayesian analysis, it injects non-data information into the analysis. Let us look at this in a simple (somewhat artificial) example, and look at the effect of imbedding f(x;θ0) in a model in different ways.
Let us assume that X1,…,Xn are iid as N(μ=10,σ2=1). So, that is the true, data-generating distribution. Now, let us embed this in a model in two different ways, model A and model B.
A:X1,…,Xn iid N(μ,σ2=1),μ∈RB:X1,…,Xn iid N(μ,μ/10),μ>0
you can check that this coincides for
μ=10.
The loglikelihood functions become
ℓA(μ)=−n2log(2π)−12∑i(xi−μ)2ℓB(μ)=−n2log(2π)−n2log(μ/10)−102∑i(xi−μ)2μ
The score functions: (loglikelihood derivatives):
ℓ˙A(μ)=n(x¯−μ)ℓ˙B(μ)=−n2μ−102∑i(xiμ)2−15n
and the curvatures
ℓ¨A(μ)=−nℓ¨B(μ)=n2μ2+102∑i2x2iμ3
so, the Fisher information do really depend on the imbedding. Now, we calculate the Fisher information at the true value
μ=10,
IA(μ=10)=n,IB(μ=10)=n⋅(1200+20202000)>n
so the Fisher information about the parameter is somewhat larger in model B.
This illustrates that, in some sense, the Fisher information tells us how fast the information from the data about the parameter would have changed if the governing parameter changed in the way postulated by the imbedding in a model family. The explanation of higher information in model B is that our model family B postulates that if the expectation would have increased, then the variance too would have increased. So that, under model B, the sample variance will also carry information about μ, which it will not do under model A.
Also, this example illustrates that we really do need some theory for helping us in how to construct model families.