Let's try to understand why MLEs are 'good':
If I get more and more data, I can uncover the truth.
Law of Large Numbers:
If distribution of the i.i.d. sample is such that has a finite expectation, i.e., then the sample average
converges to its expectation in probability, which means that for any arbitrarily small.
as.
Note. Whenever we will use the LLN below we will simply say that the average converges to its expectation and will not mention in what sense. More mathematically inclined clients are welcome to carry out these steps more rigorously, especially when we use LLN in combination with the Central Limit Theorem.
Central Limit Theorem:
If distribution of the i.i.d. sample is such that has finite expectation and variance, i.e. and , then
converges in distribution to normal distribution with zero mean and variance , which means that for any interval [a,b],
.
In other words, the random variable will behave like a random variable from normal distribution when n gets large.
We will prove that MLE satisfies usually the following two properties called consistency and asymptotic normality.
1. Consistency. We say that an estimate is consistent if in probability as , where is the 'true' unknown parameter of the distribution of the sample.
2. Asymptotic Normality. We say that is asymptotically normal if
where is called the asymptotic variance of the estimate . Asymptotic normality says that the estimator not only converges to the unknown parameter but also converges fast enough at rate .
Consistency of MLE:
Suppose that the data is generated from a distribution with unknown parameter and converges to the unknown parameter ? This is not immediately obvious and we will give a sketch of why this happens.
First of all, MLE is the maximizer of which is a log-likelihood function normalized by . Notice that function depends on data. Let us consider a function and define , where denotes the expectation with respect to the true unknown parameter of the sample .
If we deal with continuous distributions then .
By law of large numbers, for any , . Note that this does not depend on the sample, it only depends on . We will need the following.
Lemma. We have that for any , . Moreover, the inequality is strict, , unless , which means that .
Proof. Let us consider the difference
.
Since , we can write
.
Both integrals are equal to 1 because we are integrating the probability density function. This proves that . The second statement of Lemma is also clear. We will use this Lemma to sketch the consistency of the MLE.
Theorem. Under some regularity conditions on the family of distributions, MLE is consistent, i.e. as .
Proof. We have the following facts:
1) is the maximizer of by definition.
2) is the maximizer of by Lemma.
3) we have by LLN.
Asymptotic normality of MLE, Fisher information.
We want to show the asymptotic normality of MLE, i.e. to show that
for some and compute it.
This asymptotic variance in some sense measures the quality of MLE. First, we need to introduce the notion called Fisher Information.
Let us recall that above we defined the function . To simplify the notations we will denote by ', '', etc. the derivatives of with respect to .
Definition. (Fisher Information) Fisher information of a random variable X with distribution from the family is defined by
'.
Theorem. (Asymptotic normality of MLE) We have, .
Example. The family of Bernoulli distributions has p.f. and taking the logarithm . The second derivative with respect to parameter is
, .
Then the Fisher information can be computed as
.
The MLE of is and the asymptotic normality result states that
which, of course, also follows directly from the CLT.