May 30, 2023
We have for estimator its mean squared error (MSE):
with bias . This is, of course, slightly counterintuitive—the variance of the estimator does not cancel out the bias, if they happen to be in the same direction.
As we will see later, having the MSE vanish as the number of samples is increased means that the estimator is necessarily consistent.
Roughly, estimators are described by two sets of orthogonal adjectives:
That is, with more samples, the estimators get arbitrarily close to the true value. This limit is the definition for convergence in probability (convergence to a constant) (as opposed to the weaker convergence in distribution, which converges to another RV).
Usually, we may determine bias simply by evaluating in terms of the parameter we care about. Alternatively, you will see later that properties about the MMSE may make it easier to reason about bias for other estimators as well.
Thus we have four examples of different types of estimators. For the population we choose distribution . The following examples estimate the mean of the population based on a set of samples drawn from it.
Biased and inconsistent: The mean of the population is surely . We simply choose estimator , which is a point distribution at point .
Unbiased and consistent (a good estimator): We simply choose the sample mean .
Unbiased and inconsistent: An unbiased estimator is easy to construct: simply a distribution spread 50% at , and 50% at will do. For any , the probability that anything estimated will be greater than that epsilon is exactly , so it is inconsistent.
Biased and consistent: This is perhaps the most troubling of orthogonal combinations of adjectives. Let us consider an estimator for , the upper bound of a population with continuous distribution . We take
Notice this is simply distribution for -th order statistic of samples. We recall from order statistics that
In fact, this distribution is identical to the Beta distribution, scaled up times to cover the entire support of the order statistic.
Then, the mean of the estimator distribution is proportional to the mean for the Beta distribution, at . Of course, this is biased—intuitively, the estimator always approaches the true mean from the left. Alternatively, the true parameter must obviously be greater than the -th order statistic, so estimating exactly the that order statistic is almost certainly an underestimate, should there be any probability mass anywehre else.
This approach comes arbitrarily close to the true parameter, however, and for that we need two facts:
Hence, it is true that converges in probability to , and indeed, this estimator is consistent.
Intuitively, unbiased-ness is quite strong—it requires that for any number of samples, the estimator be unbiased, while consistency only requires convergence as .
Furthermore, notably, the convergence of the MSE of the estimator to as also proves consistency. This is because the bias and the variance both must go to (speed irrelevant). This is nearly a tautology.[2]
In my studies I have seen no more than four types of estimators commonly employed. Consider observed data and estimator for the following discussions.
We have for some observations the likelihood of generating such based on some variable . Then, the following function (over variousvalues of )
must have a maximum at which maximizes the likelihood of seeing under that schema. This is the maximum likelihood estimator.
To compute the ML estimator, usually one takes the likelihood function and maximizes it via derivatives. To make this easier, sometimes the log-likelihood is used, as is monotonic.
The MoM estimators are straightforward. We have, of course, the -th moments of the population :
but we also have the moments of the random sample:
We may equate the two moments with each other, for some values of , and solve for the parameter in question. Notably, we begin with , and use increasingly larger values of , as we need to estimate more parameters.
Alternatively, we may utilize the moment around the mean for :
for different estimators. The properties of these estimators need to be analyzed like any other.
The MAP estimator is very related to the ML estimator. Notably, for data , it is the value which maximizes
which is the same value of which maximizes
Should the a priori distribution be uniform, then the ML estimator, which maximizes is necessarily the same as the MAP estimator. Thus—MAP can be seen as a generalization on ML.
For example, in estimating the bias of a coin given flips of which resulted in heads, the ML estimator is obviously , and because we have no prior distribution, the reasonable conjugate prior to choose here is the Beta distribution, which must mean that the prior is uniform. As such as well.
Consider again the distribution , defined over the support of : instead of choosing the mode of this distribution, which gives the MAP estimator, we choose the mean. This gives the MMSE estimator:
Very conveniently, this estimator minimizes the mean squared error from the real statistic :
This may be verified directly via calculus and the use of the law of iterated expectations (LIE) near the end:
We have for some population generated from samples . The estimator is trivially the ML estimator. That is, the likelihood of achieving samples is
which is minimized when , but considering the lower bound on likelihood, we must have .
It is straightforward that this is a biased estimator, since the -th order statistic of a set of numbers chosen from a uniform population has mean . In fact, this means that should be an unbiased estimator. We now claim that this is also the MMSE estimator.
Before moving forward, let us also make claim to its consistency. We see that tends to Beta-distributed, with and . We have that the variance of a Beta distribution is
which tends to as tends to . Thus, in
To compute the ML estimator, usually one takes the likelihood function and maximizes it via derivatives. To make this easier, sometimes the log-likelihood is used, as is monotonic.
The MoM estimators are straightforward. We have, of course, the -th moments of the population :
but we also have the moments of the random sample:
We may equate the two moments with each other, for some values of , and solve for the parameter in question. Notably, we begin with , and use increasingly larger values of , as we need to estimate more parameters.
Alternatively, we may utilize the moment around the mean for :
for different estimators. The properties of these estimators need to be analyzed like any other.
The MAP estimator is very related to the ML estimator. Notably, for data , it is the value which maximizes
which is the same value of which maximizes
Should the a priori distribution be uniform, then the ML estimator, which maximizes is necessarily the same as the MAP estimator. Thus—MAP can be seen as a generalization on ML.
For example, in estimating the bias of a coin given flips of which resulted in heads, the ML estimator is obviously , and because we have no prior distribution, the reasonable conjugate prior to choose here is the Beta distribution, which must mean that the prior is uniform. As such as well.
Consider again the distribution , defined over the support of : instead of choosing the mode of this distribution, which gives the MAP estimator, we choose the mean. This gives the MMSE estimator:
Very conveniently, this estimator minimizes the mean squared error from the real statistic :
This may be verified directly via calculus and the use of the law of iterated expectations (LIE) near the end:
We have for some population generated from samples . The estimator is trivially the ML estimator. That is, the likelihood of achieving samples is
which is minimized when , but considering the lower bound on likelihood, we must have .
It is straightforward that this is a biased estimator, since the -th order statistic of a set of numbers chosen from a uniform population has mean . In fact, this means that should be an unbiased estimator. We now claim that this is also the MMSE estimator.
Before moving forward, let us also make claim to its consistency. We see that tends to Beta-distributed, with and . We have that the variance of a Beta distribution is
which tends to as tends to . Thus, in