Four common estimators and properties

May 30, 2023

MSE

We have for estimator Θ^\hat\Theta its mean squared error (MSE):

MSE(Θ^)=E[(Θ^Θ)2]=E[Θ^2]2ΘE[Θ^]+Θ2=E[Θ^2]E[Θ^]2+E[Θ^]22ΘE[Θ^]+Θ2=Var(Θ^)+B(Θ^)2\begin{aligned} MSE(\hat\Theta)&=E[(\hat\Theta-\Theta)^2]\\ &=E[\hat\Theta^2]-2\Theta E[\hat\Theta]+\Theta^2\\ &=E[\hat\Theta^2]-E[\hat\Theta]^2+E[\hat\Theta]^2-2\Theta E[\hat\Theta]+\Theta^2\\ &=Var(\hat\Theta)+B(\hat\Theta)^2 \end{aligned}

with bias B(Θ^)=E[Θ^]ΘB(\hat\Theta)=E[\hat\Theta]-\Theta. This is, of course, slightly counterintuitive—the variance of the estimator does not cancel out the bias, if they happen to be in the same direction.

As we will see later, having the MSE vanish as the number of samples is increased means that the estimator is necessarily consistent.

Bias and consistency

Roughly, estimators are described by two sets of orthogonal adjectives:

  1. Unbiased: the bias of the estimator, B(Θ^)B(\hat\Theta), equals 00.
  2. Consistent: for any ϵ>0\epsilon>0, we have

    limnP(Θ^nΘϵ)=0.\lim_{n\to\infty}P(|\hat\Theta_n-\Theta|\geq\epsilon)=0.

    That is, with more samples, the estimators get arbitrarily close to the true value. This limit is the definition for convergence in probability (convergence to a constant) (as opposed to the weaker convergence in distribution, which converges to another RV).

Usually, we may determine bias simply by evaluating E[Θ^]E[\hat\Theta] in terms of the parameter we care about. Alternatively, you will see later that properties about the MMSE may make it easier to reason about bias for other estimators as well.

Thus we have four examples of different types of estimators. For the population we choose distribution U[0,1]U[0, 1]. The following examples estimate the mean of the population based on a set of samples XiX_i drawn from it.

Biased and inconsistent: The mean of the population is surely Θ=1/2\Theta=1/2. We simply choose estimator Θ^=0\hat\Theta=0, which is a point distribution at point x=0x=0.

Unbiased and consistent (a good estimator): We simply choose the sample mean Xˉ=(X1+)/n\bar X=(X_1+\cdots)/n.

Unbiased and inconsistent: An unbiased estimator is easy to construct: simply a distribution spread 50% at 00, and 50% at 11 will do. For any ϵ<0.5\epsilon<0.5, the probability that anything estimated will be greater than that epsilon is exactly 11, so it is inconsistent.

Biased and consistent: This is perhaps the most troubling of orthogonal combinations of adjectives. Let us consider an estimator for Θ\Theta, the upper bound of a population with continuous distribution U[0,Θ]U[0,\Theta]. We take

Θ^=max(X1,).\hat\Theta=\max(X_1,\ldots).

Notice this is simply distribution for nn-th order statistic of nn samples. We recall from order statistics that

P(Θ^=x)(nn1)(1)f(x)F(x)n1(1F(x))0Beta(n,1)(x/Θ).P(\hat\Theta=x)\propto {n\choose n-1}(1)f(x)F(x)^{n-1}(1-F(x))^0\propto Beta(n, 1)(x/\Theta).

In fact, this distribution is identical to the Beta distribution, scaled up Θ\Theta times to cover the entire support of the order statistic.

Then, the mean of the estimator distribution is proportional to the mean for the Beta distribution, at n/(n+1)Θn/(n+1)\cdot\Theta. Of course, this is biased—intuitively, the estimator always approaches the true mean from the left. Alternatively, the true parameter Θ\Theta must obviously be greater than the nn-th order statistic, so estimating exactly the that order statistic is almost certainly an underestimate, should there be any probability mass anywehre else.

This approach comes arbitrarily close to the true parameter, however, and for that we need two facts:

  1. The mean, E[Θ^]=nn+1ΘE[\hat\Theta]=\frac{n}{n+1}\Theta gets arbitrarily close to Θ\Theta.
  2. The variance, same as the Beta distribution itself, potentially scaled with a constant, is Var(Θ^)Var(Beta(n,1))=n/(n+1)2/(n+2)Var(\hat\Theta)\propto Var(Beta(n,1))=n/(n+1)^2/(n+2). Obviously, this goes to 00. The variance may also be derived directly via the formula.

Hence, it is true that Θ^\hat\Theta converges in probability to Θ\Theta, and indeed, this estimator is consistent.


Intuitively, unbiased-ness is quite strong—it requires that for any nn number of samples, the estimator be unbiased, while consistency only requires convergence as nn\to\infty.

Furthermore, notably, the convergence of the MSE of the estimator to 00 as nn\to\infty also proves consistency. This is because the bias and the variance both must go to 00 (speed irrelevant). This is nearly a tautology.[2]

Methods of estimation[3]

In my studies I have seen no more than four types of estimators commonly employed. Consider observed data XX and estimator Θ^\hat\Theta for the following discussions.

Maximum likelihood (ML)

We have for some observations XX the likelihood of generating such XX based on some variable Θ\Theta. Then, the following function (over variousvalues of Θ\Theta)

P(XΘ)P(X|\Theta)

must have a maximum at Θ=Θ^\Theta=\hat\Theta which maximizes the likelihood of seeing XX under that schema. This is the maximum likelihood estimator.

To compute the ML estimator, usually one takes the likelihood function L(Θ)L(\Theta) and maximizes it via derivatives. To make this easier, sometimes the log-likelihood is used, as log\log is monotonic.

Method of moments (MoM)[7]

The MoM estimators are straightforward. We have, of course, the ss-th moments of the population XX:

E[Xs]E[X^s]

but we also have the moments of the random sample:

E[xs].E[x^s].

We may equate the two moments with each other, for some values of ss, and solve for the parameter in question. Notably, we begin with s=1s=1, and use increasingly larger values of ss, as we need to estimate more parameters.

Alternatively, we may utilize the moment around the mean for s2s\geq 2:

E[(XE[X])s]E[(X-E[X])^s]

for different estimators. The properties of these estimators need to be analyzed like any other.

Maximum a posteriori (MAP)

The MAP estimator is very related to the ML estimator. Notably, for data XX, it is the value Θ^\hat\Theta which maximizes

P(ΘX)=P(XΘ)P(Θ)/P(X)P(\Theta|X)=P(X|\Theta)P(\Theta)/P(X)

which is the same value of Θ^\hat\Theta which maximizes

P(ΘX)P(XΘ)P(Θ).P(\Theta|X)\propto P(X|\Theta)P(\Theta).

Should the a priori distribution P(Θ)P(\Theta) be uniform, then the ML estimator, which maximizes P(XΘ)P(X|\Theta) is necessarily the same as the MAP estimator. Thus—MAP can be seen as a generalization on ML.

For example, in estimating the bias pp of a coin given 1414 flips of which 1010 resulted in heads, the ML estimator is obviously p^ML=10/14\hat p_{ML}=10/14, and because we have no prior distribution, the reasonable conjugate prior to choose here is the Beta distribution, which must mean that the prior is uniform. As such p^MAP=10/14\hat p_{MAP}=10/14 as well.

Minimum mean squared error (MMSE)[4]/Expected a posteriori (EAP)[5]

Consider again the distribution ΘX\Theta|X, defined over the support of Θ\Theta: instead of choosing the mode of this distribution, which gives the MAP estimator, we choose the mean. This gives the MMSE estimator:

Θ^MMSE=E[ΘX].\hat\Theta_{MMSE}=E[\Theta|X].

Very conveniently, this estimator Θ^\hat\Theta minimizes the mean squared error from the real statistic Θ\Theta:

Θ^=minθE[(Θθ)2].\hat\Theta=\min_\theta E[(\Theta-\theta)^2].

This may be verified directly via calculus and the use of the law of iterated expectations (LIE) near the end:

E[X]=E[E[XY]].E[X]=E[E[X|Y]].

Example: stimator for uniform RV

We have for some population generated from X=U[0,Θ]\mathbf{X}=U[0,\Theta] samples x=x1,,xn\mathbf{x}=x_1,\ldots,x_n. The estimator Θ^=max(xi)=x(n)\hat\Theta=\max(x_i)=x_{(n)} is trivially the ML estimator. That is, the likelihood of achieving samples x\mathbf{x} is

L(XΘ)=(1/Θ)n when ΘX(n)L(\mathbf{X}|\Theta)=(1/\Theta)^n\text{ when }\Theta\geq X_{(n)}

which is minimized when dL/dΘ=0=nlnΘ(1/Θ)n1    Θ^=dL/d\Theta=0=n\ln\Theta(1/\Theta)^{n-1}\implies\hat\Theta=-\infty, but considering the lower bound on likelihood, we must have Θ^ML=x(n)\hat\Theta_{ML}=\mathbf{x}_{(n)}.

It is straightforward that this is a biased estimator, since the nn-th order statistic of a set of nn numbers chosen from a uniform population has mean n/(n+1)n/(n+1). In fact, this means that (n+1)/nx(n)(n+1)/n\cdot \mathbf{x}_{(n)} should be an unbiased estimator. We now claim that this is also the MMSE estimator.

Before moving forward, let us also make claim to its consistency. We see that (n+1)/nx(n)(n+1)/n\cdot\mathbf{x}_{(n)} tends to Beta-distributed, with α=n\alpha=n and β=1\beta=1. We have that the variance of a Beta distribution is

αβ/O(α3+β3)\alpha\beta/O(\alpha^3+\beta^3)

which tends to 00 as nn tends to \infty. Thus, in

MSE(Θ^MMSE)=Bias(Θ^MMSE)+Var(Θ^MMSE)MSE(\hat\Theta_{MMSE})=Bias(\hat\Theta_{MMSE})+Var(\hat\Theta_{MMSE})

To compute the ML estimator, usually one takes the likelihood function L(Θ)L(\Theta) and maximizes it via derivatives. To make this easier, sometimes the log-likelihood is used, as log\log is monotonic.

Method of moments (MoM)[7]

The MoM estimators are straightforward. We have, of course, the ss-th moments of the population XX:

E[Xs]E[X^s]

but we also have the moments of the random sample:

E[xs].E[x^s].

We may equate the two moments with each other, for some values of ss, and solve for the parameter in question. Notably, we begin with s=1s=1, and use increasingly larger values of ss, as we need to estimate more parameters.

Alternatively, we may utilize the moment around the mean for s2s\geq 2:

E[(XE[X])s]E[(X-E[X])^s]

for different estimators. The properties of these estimators need to be analyzed like any other.

Maximum a posteriori (MAP)

The MAP estimator is very related to the ML estimator. Notably, for data XX, it is the value Θ^\hat\Theta which maximizes

P(ΘX)=P(XΘ)P(Θ)/P(X)P(\Theta|X)=P(X|\Theta)P(\Theta)/P(X)

which is the same value of Θ^\hat\Theta which maximizes

P(ΘX)P(XΘ)P(Θ).P(\Theta|X)\propto P(X|\Theta)P(\Theta).

Should the a priori distribution P(Θ)P(\Theta) be uniform, then the ML estimator, which maximizes P(XΘ)P(X|\Theta) is necessarily the same as the MAP estimator. Thus—MAP can be seen as a generalization on ML.

For example, in estimating the bias pp of a coin given 1414 flips of which 1010 resulted in heads, the ML estimator is obviously p^ML=10/14\hat p_{ML}=10/14, and because we have no prior distribution, the reasonable conjugate prior to choose here is the Beta distribution, which must mean that the prior is uniform. As such p^MAP=10/14\hat p_{MAP}=10/14 as well.

Minimum mean squared error (MMSE)[4]/Expected a posteriori (EAP)[5]

Consider again the distribution ΘX\Theta|X, defined over the support of Θ\Theta: instead of choosing the mode of this distribution, which gives the MAP estimator, we choose the mean. This gives the MMSE estimator:

Θ^MMSE=E[ΘX].\hat\Theta_{MMSE}=E[\Theta|X].

Very conveniently, this estimator Θ^\hat\Theta minimizes the mean squared error from the real statistic Θ\Theta:

Θ^=minθE[(Θθ)2].\hat\Theta=\min_\theta E[(\Theta-\theta)^2].

This may be verified directly via calculus and the use of the law of iterated expectations (LIE) near the end:

E[X]=E[E[XY]].E[X]=E[E[X|Y]].

Example: stimator for uniform RV

We have for some population generated from X=U[0,Θ]\mathbf{X}=U[0,\Theta] samples x=x1,,xn\mathbf{x}=x_1,\ldots,x_n. The estimator Θ^=max(xi)=x(n)\hat\Theta=\max(x_i)=x_{(n)} is trivially the ML estimator. That is, the likelihood of achieving samples x\mathbf{x} is

L(XΘ)=(1/Θ)n when ΘX(n)L(\mathbf{X}|\Theta)=(1/\Theta)^n\text{ when }\Theta\geq X_{(n)}

which is minimized when dL/dΘ=0=nlnΘ(1/Θ)n1    Θ^=dL/d\Theta=0=n\ln\Theta(1/\Theta)^{n-1}\implies\hat\Theta=-\infty, but considering the lower bound on likelihood, we must have Θ^ML=x(n)\hat\Theta_{ML}=\mathbf{x}_{(n)}.

It is straightforward that this is a biased estimator, since the nn-th order statistic of a set of nn numbers chosen from a uniform population has mean n/(n+1)n/(n+1). In fact, this means that (n+1)/nx(n)(n+1)/n\cdot \mathbf{x}_{(n)} should be an unbiased estimator. We now claim that this is also the MMSE estimator.

Before moving forward, let us also make claim to its consistency. We see that (n+1)/nx(n)(n+1)/n\cdot\mathbf{x}_{(n)} tends to Beta-distributed, with α=n\alpha=n and β=1\beta=1. We have that the variance of a Beta distribution is

αβ/O(α3+β3)\alpha\beta/O(\alpha^3+\beta^3)

which tends to 00 as nn tends to \infty. Thus, in

MSE(Θ^MMSE)=Bias(Θ^MMSE)+Var(Θ^MMSE)MSE(\hat\Theta_{MMSE})=Bias(\hat\Theta_{MMSE})+Var(\hat\Theta_{MMSE})