A summary on Maximum likelihood Estimator

Why to learn MLE?

The drawback of least square estimator

A general method of building a predictive model requires least square estimation at first. Then we need work on the residuals, find the confidence interval of parameters and test how well the model fits the data which are based on the normally distributed assumption of the residuals (or noises). But unfortunately the assumption is not guaranteed. Most of the time, you will have a graph of residuals that looks like another distribution rather than the normal.

At this moment you could add one more factor term to your model so as to filter out the non-normal distributed noise, and then calculate the LSE again. But you may still have the same problem again. Or if you can recognize the distribution of the graph (or somehow you know the pdf of the noise), you can just calculate the MLE of the parameters of your model. This time, your work is really finished.

The procedure of calculating MLE

Given

{Yi}: training data of response variable;{i}: training data of predictor variable;

: The predictive model; : residual of the model which has a pdf of g;

;

Then the likelihood function of the sample is:

And its log-likelihood function is :

=

Then the MLE of is the value that maximizes obtained by solving equation system:

=

And which should satisfy the Hessian criteria.

Why MLE is fit for Big Data Analysis?

Regular conditions:

1, the first 3 derivatives of ln( with respect to should be continuous and finite.

2,the cumulative integral should not depend on the parameters.

With these assumptions, the MLE is made for Big Data Analysis because of its asymptotic properties:

The consistency property

Let be the MLE of , then

Asymptotic normality

Let be the MLE of which is one of the entries in , then:

converges in distribution to a normal distribution , where is the Fisher information of and its formula is:

Asymptotic efficiency

By asymptotic efficient, we describe an estimator that has the 2 above properties and the asymptotic variance (which is the variance of the asymptotic normal distribution) equal to the Cramer-Rao lower bound.

Invariance property

This property is somehow ambiguous. It actually means that the MLE is always unique.

Conclusion

The mainly feature of Big Data is that the sample size is extraordinary large, which implies n is large enough. Therefor the discrepancy of the MLE from the true value is very small. Then if we found that the asymptotic variance is also acceptable small, the MLE is a good estimator for the model.

For the proof of the first 2 properties, you can refer to the following link: