Maximum Likelihood Estimation

Estimating Parameters of Distributions

We almost never know the true distribution of a data sample.

  • We might hypothesize a family of distributions that capture broad characteristics of the data (locations, scale and shape).
  • However, there may be a set of one or more parameters of the distribution that we don’t know.
  • Typically we use the data to estimate the unknown parameters, \(\smash{\boldsymbol{\theta}}\).

Joint CDF

Suppose we have a collection of random variables \(\smash{{\bf Y}_T = (Y_1, \ldots, Y_T)'}\).

  • We view a data sample of size \(\smash{T}\) as one realization of each random variable: \(\smash{{\bf y}_T = (y_1, \ldots, y_T)'}\).
  • The joint cumulative density of \(\smash{{\bf Y}_T}\) is
\[F_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) = P(Y_1 \leq y_1, \ldots, Y_T \leq y_T).\]

Joint Density

  • The joint probability density of \(\smash{{\bf Y}_T}\) is
\[\begin{align*} f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) & = \frac{\partial^T F_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})}{\partial Y_1 \ldots \partial Y_T}. \end{align*}\]

since

\[\begin{align*} F_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})& = \int_{-\infty}^{y_1}\ldots \int_{-\infty}^{y_T} f_{{\bf Y}_T}({\bf a}_T|\boldsymbol{\theta}) \,\, da_1 \ldots da_T. \end{align*}\]

Independence

When \(\smash{Y_1, \ldots, Y_T}\) are independent of each other and have identical distributions:

  • We say that they are independent and identically distributed, or i.i.d.
  • When \(\smash{Y_1, \ldots, Y_T}\) are i.i.d., they have the same marginal densities:
\[f_{Y_1}(y|\boldsymbol{\theta}) = \ldots = f_{Y_T}(y|\boldsymbol{\theta}).\]

Joint Density Under Independence

Further, when \(\smash{Y_1, \ldots, Y_T}\) are i.i.d.

\[\begin{align*} f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) & = f_{Y_1}(y_1|\boldsymbol{\theta}) \cdot f_{Y_2}(y_2|\boldsymbol{\theta}) \cdots f_{Y_T}(y_T|\boldsymbol{\theta}) = \prod_{i=1}^T f_{Y_i}(y_i|\boldsymbol{\theta}). \end{align*}\]
  • This is analogous to the computation of joint probabilities.
  • For independent events \(\smash{A}\), \(\smash{B}\) and \(\smash{C}\),
\[\begin{align*} P(A \cap B \cap C) = P(A)P(B)P(C). \end{align*}\]

Bayes’ Rule

Recall Bayes’ Rule:

\[\begin{split}\begin{gather} P(A|B) = \frac{P(A \cap B)}{P(B)} \\ \Rightarrow P(A \cap B) = P(A|B)P(B). \end{gather}\end{split}\]

Iterating:

\[\begin{split}\begin{align*} P(A \cap B \cap C) & = P((A \cap B) \cap C) \\ & = P(A \cap B | C) P(C) \\ & = P(A|B,C)P(B|C)P(C). \end{align*}\end{split}\]

Bayes’ Rule

The previous iteration generalizes to be:

\[\begin{split}\begin{align*} P\left(\bigcap_{i=1}^n A_i\right) & = P(A_n|A_{n-1},\ldots,A_1) \\ & \hspace{0.5in} \times P(A_{n-1}|A_{n-2},\ldots,A_1) \cdots P(A_2|A_1) P(A_1) \\ & = P(A_1) \prod_{i=2}^n P(A_i|{\bf A}_{i-1}) \end{align*}\end{split}\]

where \(\smash{{\bf A_i} = (A_1,\ldots,A_i)'}\)

General Joint Density

Suppose \(\smash{Y_1, \ldots, Y_T}\) depend on each other sequentially.

  • We use Bayes’ Rule to obtain the joint density:
\[f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) = f_{Y_1}(y_1|\boldsymbol{\theta}) \prod_{t=2}^T f_{Y_t|{\bf Y}_{t-1}}(y_t|{\bf y}_{t-1},\boldsymbol{\theta}).\]

Maximum Likelihood Estimation

One of the most important and powerful methods of parameter estimation is maximum likelihood estimation. It requires

  • A data sample: \(\smash{{\bf y} = (y_1, \ldots, y_T)'}\).
  • A joint probability density:
\[f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) = f_{Y_1}(y_1|\boldsymbol{\theta}) \prod_{t=2}^T f_{Y_t|{\bf Y}_{t-1}}(y_t|{\bf y}_{t-1},\boldsymbol{\theta}).\]

Likelihood

\(\smash{f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})}\) is loosely interpreted as the probability of observing data sample \(\smash{{\bf y}_T}\), given a functional form for the density of \(\smash{Y_1, \ldots, Y_T}\) and given a set of parameters \(\smash{\boldsymbol{\theta}}\).

  • We can reverse the notion and think of \(\smash{{\bf y}_T}\) as being fixed and \(\smash{\boldsymbol{\theta}}\) some unknown variable.
  • In this case we write \(\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T) = f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})}\).
  • We refer to \(\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)}\) as the likelihood.
  • Fixing \(\smash{{\bf y}_T}\), maximum likelihood estimation chooses the value of \(\smash{\boldsymbol{\theta}}\) that maximizes \(\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T) = f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})}\).

Likelihood Maximization

Given \(\smash{\boldsymbol{\theta} = (\theta_1, \ldots, \theta_p)'}\), we maximize \(\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)}\) by

  • Differentiating with respect to each \(\smash{\theta_i}\), \(\smash{i = 1, \ldots, p}\).
  • Setting the resulting derivatives equal to zero.
  • Solving for the values \(\smash{\hat{\theta}_i}\), \(\smash{i = 1, \ldots, p}\), that make all of the derivatives zero.

Log Likelihood

It is often easier to work with the logarithm of the likelihood function.

  • By the properties of logarithms
\[\begin{split}\begin{align*} \ell(\boldsymbol{\theta}|{\bf y}_T) & = \log\left(\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)\right) \\ & = \log\left(f_{Y_1}(y_1|\boldsymbol{\theta}) \prod_{t=2}^T f_{Y_t|{\bf Y}_{t-1}}(y_t|{\bf y}_{t-1},\boldsymbol{\theta}) \right) \\ & = \log\left(f_{Y_1}(y_1|\boldsymbol{\theta})\right) + \sum_{t=2}^T \log\left(f_{Y_t|{\bf Y}_{t-1}}(y_t|{\bf y}_{t-1},\boldsymbol{\theta})\right) \end{align*}\end{split}\]

Log Likelihood

  • Maximizing \(\smash{\ell(\boldsymbol{\theta}|{\bf y}_T)}\) is the same as maximizing \(\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)}\) since \(\smash{\log}\) is a monotonic transformation.
  • A derivative of \(\smash{\mathcal{L}}\) will involve many product-rule terms, whereas a derivative of \(\smash{\ell}\) will simply be a sum of derivatives.

MLE Example

Suppose we have a dataset \(\smash{{\bf y}_n = (y_1, \ldots, y_n)}\), where

\[Y_1, \ldots, Y_n \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu,\sigma^2).\]
  • We will assume \(\smash{\mu}\) is unknown and \(\smash{\sigma}\) is known.
  • So, \(\smash{\boldsymbol{\theta} = \mu}\) (it is a single value, rather than a vector).

MLE Example

  • The likelihood is
\[\begin{split}\begin{align*} \mathcal{L}(\mu|{\bf y}_n) & = f_{{\bf Y}_n}({\bf y}_n|\mu) \\ & = \prod_{i=1}^n f_{Y_i}(y_i|\mu) \\ & = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left\{-\frac{1}{2} \frac{(y_i - \mu)^2}{\sigma^2} \right\} \\ & = \frac{1}{(2\pi \sigma^2)^{n/2}} \exp \left\{-\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2 \right\}. \end{align*}\end{split}\]

MLE Example

The log likelihood is

\[\ell(\mu|{\bf y}_n) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2.\]

MLE Example

  • The MLE, \(\smash{\hat{\mu}}\), is the value that sets \(\smash{\frac{d}{d \mu} \ell(\mu|{\bf y}) = 0}\):
\[\begin{split}\begin{gather} \frac{d}{d \mu} \ell(\mu|{\bf y}_n) \bigg|_{\hat{\mu}} = \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \hat{\mu}) = 0 \\ \Rightarrow \sum_{i=1}^n (y_i - \hat{\mu}) = 0 \\ \Rightarrow \sum_{i=1}^n \hat{\mu} = \sum_{i=1}^n y_i \\ \Rightarrow \hat{\mu} = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i. \end{gather}\end{split}\]

MLE Example: \(\smash{n=1}\), Unknown \(\smash{\mu}\)

Suppose we have only one observation: \(\smash{y_1}\).

  • If we specialize the previous result:
\[\begin{align} \hat{\mu} = y_1. \end{align}\]
  • The density \(\smash{f_{Y_1}(y_1|\mu)}\) gives the probability of observing some data value \(\smash{y_1}\), conditional on some known parameter \(\smash{\mu}\).
  • This is a normal distribution with mean \(\smash{\mu}\) and variance \(\smash{\sigma^2}\).

MLE Example: \(\smash{n=1}\), Unknown \(\smash{\mu}\)

  • The likelihood \(\smash{\mathcal{L}(\mu|y_1)}\) gives the probability of \(\smash{\mu}\), conditional on some observed data value \(\smash{y_1}\).
  • This is a normal distribution with mean \(\smash{y_1}\) and variance \(\smash{\sigma^2}\).

MLE Example: \(\smash{n=1}\)

../_images/densExample.png

MLE Example: \(\smash{n=1}\)

#######################################################################################
# Plot likelihood for normal data with unknown mean
#######################################################################################

# Generate the true normal density
mu = 10;
sig = 15;
xGrid = seq(mu-5*sig, mu+5*sig, length=10000)
trueDens = dnorm(xGrid, mu, sig)

# A couple of possible data values
y11 = 30;
y12 = -17.7;

# Plot the true normal distribution with possible data values
plot(xGrid, trueDens, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='',
     yaxt='n', bty='L')
axis(1, labels=c(expression(mu), expression(y[1]), expression(y[1])),
     at=c(mu, y11, y12))
abline(v=mu)
segments(y11, 0, y11, dnorm(y11, mu, sig), lty=2)
segments(y12, 0, y12, dnorm(y12, mu, sig), lty=2)
dev.copy(png, file="densExample.png", height=600, width=1000)
dev.off()

MLE Example: \(\smash{n=1}\), Unknown \(\smash{\mu}\)

../_images/likeExample.png

MLE Example: \(\smash{n=1}\), Unknown \(\smash{\mu}\)

# Plot several densities for fixed data observation
mu1 = -33
mu2 = 27
dens1 = dnorm(xGrid, mu1, sig)
dens2 = dnorm(xGrid, mu2, sig)
par(mfrow=c(2,1))
plot(xGrid, trueDens, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='',
     yaxt='n', bty='L')
axis(1, labels=c(expression(mu^1), expression(mu^2), expression(mu^2), expression(y[1])),
     at=c(mu, mu1, mu2, y12))
lines(xGrid, dens1)
lines(xGrid, dens2)
abline(v=mu)
abline(v=mu1)
abline(v=mu2)
segments(y12, 0, y12, dnorm(y12, mu, sig), lty=2)
segments(y12, 0, y12, dnorm(y12, mu1, sig), lty=3)
segments(y12, 0, y12, dnorm(y12, mu2, sig), lty=4)

# Plot the resulting likelihood
like = dnorm(xGrid, y12, sig)
plot(xGrid, like, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='',
     yaxt='n', bty='L')
axis(1, labels=c(expression(mu^1), expression(mu^2), expression(mu^2), expression(y[1])),
     at=c(mu, mu1, mu2, y12))
abline(v=y12)
segments(mu, 0, mu, dnorm(mu, y12, sig), lty=2)
segments(mu1, 0, mu1, dnorm(mu1, y12, sig), lty=2)
segments(mu2, 0, mu2, dnorm(mu2, y12, sig), lty=2)
dev.copy(png, file="likeExample.png", height=600, width=400)
dev.off()

MLE Example: \(\smash{n=1}\), Unknown \(\smash{\sigma}\)

Let’s continue with the assumption of one data observation, \(\smash{y_1}\).

  • If \(\smash{\mu}\) is known but \(\smash{\sigma}\) is unknown, the density of the data, \(\smash{y_1}\), is still normal.
  • However, the likelihood is
\[\mathcal{L}(\sigma^2|y_1) = \frac{\alpha}{\sigma^2} \exp\left\{-\frac{\beta}{\sigma^2}\right\}\]
\[\alpha = \frac{1}{\sqrt{2\pi}}, \qquad \beta = \frac{(y_1-\mu)^2}{2}.\]
  • The likelihood for \(\smash{\sigma^2}\) is not normal, but inverse gamma.

MLE Example: \(\smash{n=1}\), Unknown \(\smash{\sigma}\)

../_images/likeExample2.png

MLE Example: \(\smash{n=1}\), Unknown \(\smash{\sigma}\)

#######################################################################################
# Plot likelihood for normal data with unknown sd
#######################################################################################

# Plot several densities for fixed data observation
sig1 = 50
sig2 = 25
dens1 = dnorm(xGrid, mu, sig1)
dens2 = dnorm(xGrid, mu, sig2)
par(mfrow=c(2,1))
yMax = max(max(trueDens), max(dens1), max(dens2))
plot(xGrid, trueDens, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='',
     yaxt='n', bty='L', ylim=c(0,yMax))
axis(1, labels=c(expression(mu), expression(y[1])), at=c(mu, y12))
lines(xGrid, dens1)
lines(xGrid, dens2)
abline(v=mu)
segments(y12, 0, y12, dnorm(y12, mu, sig), lty=2)
segments(y12, 0, y12, dnorm(y12, mu, sig1), lty=3)
segments(y12, 0, y12, dnorm(y12, mu, sig2), lty=4)
xDist = max(xGrid)-mu
text(c(mu+0.29*xDist, mu+0.4*xDist, mu+0.7*xDist), c(0.66, 0.4, 0.23)*yMax,
     labels=c(expression(sigma[1]^2), expression(sigma[2]^2), expression(sigma[3]^2)))

# Plot the resulting likelihood (which is an inverse gamma)
alpha = -0.5
beta = ((y12 - mu)^2)/2
scale = 1/sqrt(2*pi)
sigGrid = seq(0.01,3000,length=10000)
like = scale*(sigGrid^(-alpha-1))*exp(-beta/sigGrid)
likeTrue = scale*(sig^(-2*alpha-2))*exp(-beta/sig^2)
like1 = scale*(sig1^(-2*alpha-2))*exp(-beta/sig1^2)
like2 = scale*(sig2^(-2*alpha-2))*exp(-beta/sig2^2)
plot(sigGrid, like, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='',
     yaxt='n', bty='L')
axis(1, labels=c(expression(sigma[1]^2), expression(sigma[2]^2), expression(sigma[3]^2)),
     at=c(sig^2, sig1^2, sig2^2))
segments(sig^2, min(like), sig^2, likeTrue, lty=2)
segments(sig1^2, min(like), sig1^2, like1, lty=2)
segments(sig2^2, min(like), sig2^2, like2, lty=2)
dev.copy(png, file="likeExample2.png", height=600, width=400)
dev.off()

MLE Accuracy

Maximum likelihood estimation results in estimates of true unknown parameters.

  • What is the probability that our estimates are identical to the true population parameters?
  • Our estimates are imprecise and contain error.
  • We would like to quantify the precision of our estimates with standard errors.
  • We will use the Fisher Information to compute standard errors.

Fisher Information

Let \(\smash{\mathcal{H}(\boldsymbol{\theta}|{\bf y}_t) = \frac{d^2}{d \boldsymbol{\theta}^2} \ell(\boldsymbol{\theta}|{\bf y}_T)}\), the matrix of second derivatives of the log likelihood.

  • The Fisher Information is
\[\mathcal{I}(\boldsymbol{\theta}) = - E\left[ \mathcal{H}(\boldsymbol{\theta}|{\bf y}_t)\right].\]
  • The observed Fisher Information is
\[\widetilde{\mathcal{I}}(\boldsymbol{\theta}) = - \mathcal{H}(\boldsymbol{\theta}|{\bf y}_t).\]

Fisher Information

  • Observed Fisher Information does not take an expectation, which may be difficult to compute.
  • Since \(\smash{\ell(\boldsymbol{\theta}|{\bf y}_T)}\) is often a sum of many terms, \(\smash{\widetilde{\mathcal{I}}(\boldsymbol{\theta})}\) will converge to \(\smash{\mathcal{I}(\boldsymbol{\theta})}\) for large samples.

MLE Central Limit Theorem

Under certain conditions, a central limit theorem holds for the MLE, \(\smash{\hat{\boldsymbol{\theta}}}\).

  • For infinitely large samples \(\smash{{\bf y}_T}\),
\[\hat{\boldsymbol{\theta}} \sim \mathcal{N}(\boldsymbol{\theta}, \mathcal{I}(\boldsymbol{\theta})^{-1}).\]
  • For large samples, \(\smash{\hat{\boldsymbol{\theta}}}\) is normally distributed regardless of the distribution of the data, \(\smash{{\bf y}_T}\).

MLE Central Limit Theorem

  • \(\smash{\hat{\boldsymbol{\theta}}}\) is also normally distributed for large samples even if \(\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)}\) is some other distribution.
  • Hence, for large samples,
\[Var(\hat{\theta}_i) = \frac{1}{\mathcal{I}(\boldsymbol{\theta})_{ii}} \qquad \Rightarrow \qquad Std(\hat{\theta}_i) = \frac{1}{\sqrt{\mathcal{I}(\boldsymbol{\theta})_{ii}}}\]

MLE Standard Errors

Since we don’t know the true \(\smash{\boldsymbol{\theta}}\), we approximate

\[Std(\hat{\theta}_i) \approx \frac{1}{\sqrt{\mathcal{I}(\hat{\boldsymbol{\theta}})_{ii}}}.\]
  • Alternatively, to avoid computing the expectation, we could use the approximation
\[Std(\hat{\theta}_i) \approx \frac{1}{\sqrt{\widetilde{\mathcal{I}}(\hat{\boldsymbol{\theta}})_{ii}}}.\]

MLE Standard Errors

  • In reality, we never have an infinite sample size.
  • For finite samples, these values are approximations of the standard errors of the components of \(\smash{\hat{\boldsymbol{\theta}}}\).

MLE Variance Example

Let’s return to the example where \(\smash{Y_1, \ldots, Y_n \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2)}\), with known \(\smash{\sigma}\).

  • The log likelihood is
\[\ell(\mu|{\bf y}_n) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2.\]
  • The resulting derivatives are
\[\frac{\partial \ell(\mu|{\bf y}_n)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \mu), \qquad \frac{\partial^2 \ell(\mu|{\bf y}_n)}{\partial \mu^2} = -\frac{n}{\sigma^2}.\]

MLE Variance Example

In this case the Fisher Information is identical to the observed Fisher Information:

\[\mathcal{I}(\mu) = -E\left[-\frac{n}{\sigma^2}\right] = \frac{n}{\sigma^2} = \widetilde{\mathcal{I}}(\mu).\]
  • Since \(\smash{\mathcal{I}(\mu)}\) doesn’t depend on \(\smash{\mu}\), we don’t need to resort to an approximation with \(\smash{\hat{\mu} = \bar{{\bf y}}}\).
  • The result is
\[Std(\hat{\mu}) = \frac{1}{\sqrt{\mathcal{I}(\mu)}} = \frac{\sigma}{\sqrt{n}}.\]