============================================================================== Maximum Likelihood Estimation ============================================================================== Estimating Parameters of Distributions ============================================================================== We almost never know the true distribution of a data sample. .. raw:: - We might hypothesize a family of distributions that capture broad characteristics of the data (locations, scale and shape). .. raw:: - However, there may be a set of one or more parameters of the distribution that we don't know. .. raw:: - Typically we use the data to estimate the unknown parameters, :math:`\smash{\boldsymbol{\theta}}`. Joint CDF ============================================================================== Suppose we have a collection of random variables :math:`\smash{{\bf Y}_T = (Y_1, \ldots, Y_T)'}`. .. raw:: - We view a data sample of size :math:`\smash{T}` as one realization of each random variable: :math:`\smash{{\bf y}_T = (y_1, \ldots, y_T)'}`. .. raw:: - The *joint cumulative density* of :math:`\smash{{\bf Y}_T}` is .. math:: F_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) = P(Y_1 \leq y_1, \ldots, Y_T \leq y_T). Joint Density ============================================================================== - The *joint probability density* of :math:`\smash{{\bf Y}_T}` is .. math:: \begin{align*} f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) & = \frac{\partial^T F_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})}{\partial Y_1 \ldots \partial Y_T}. \end{align*} since .. math:: \begin{align*} F_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})& = \int_{-\infty}^{y_1}\ldots \int_{-\infty}^{y_T} f_{{\bf Y}_T}({\bf a}_T|\boldsymbol{\theta}) \,\, da_1 \ldots da_T. \end{align*} Independence ============================================================================== When :math:`\smash{Y_1, \ldots, Y_T}` are independent of each other and have identical distributions: .. raw:: - We say that they are *independent and identically distributed*, or i.i.d. .. raw:: - When :math:`\smash{Y_1, \ldots, Y_T}` are i.i.d., they have the same marginal densities: .. math:: f_{Y_1}(y|\boldsymbol{\theta}) = \ldots = f_{Y_T}(y|\boldsymbol{\theta}). Joint Density Under Independence ============================================================================== Further, when :math:`\smash{Y_1, \ldots, Y_T}` are i.i.d. .. math:: \begin{align*} f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) & = f_{Y_1}(y_1|\boldsymbol{\theta}) \cdot f_{Y_2}(y_2|\boldsymbol{\theta}) \cdots f_{Y_T}(y_T|\boldsymbol{\theta}) = \prod_{i=1}^T f_{Y_i}(y_i|\boldsymbol{\theta}). \end{align*} .. raw:: - This is analogous to the computation of joint probabilities. .. raw:: - For independent events :math:`\smash{A}`, :math:`\smash{B}` and :math:`\smash{C}`, .. math:: \begin{align*} P(A \cap B \cap C) = P(A)P(B)P(C). \end{align*} Bayes' Rule ============================================================================== Recall Bayes' Rule: .. math:: \begin{gather} P(A|B) = \frac{P(A \cap B)}{P(B)} \\ \Rightarrow P(A \cap B) = P(A|B)P(B). \end{gather} .. raw:: Iterating: .. math:: \begin{align*} P(A \cap B \cap C) & = P((A \cap B) \cap C) \\ & = P(A \cap B | C) P(C) \\ & = P(A|B,C)P(B|C)P(C). \end{align*} Bayes' Rule ============================================================================== The previous iteration generalizes to be: .. math:: \begin{align*} P\left(\bigcap_{i=1}^n A_i\right) & = P(A_n|A_{n-1},\ldots,A_1) \\ & \hspace{0.5in} \times P(A_{n-1}|A_{n-2},\ldots,A_1) \cdots P(A_2|A_1) P(A_1) \\ & = P(A_1) \prod_{i=2}^n P(A_i|{\bf A}_{i-1}) \end{align*} where :math:`\smash{{\bf A_i} = (A_1,\ldots,A_i)'}` General Joint Density ============================================================================== Suppose :math:`\smash{Y_1, \ldots, Y_T}` depend on each other sequentially. .. raw:: - We use Bayes' Rule to obtain the joint density: .. math:: f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) = f_{Y_1}(y_1|\boldsymbol{\theta}) \prod_{t=2}^T f_{Y_t|{\bf Y}_{t-1}}(y_t|{\bf y}_{t-1},\boldsymbol{\theta}). Maximum Likelihood Estimation ============================================================================== One of the most important and powerful methods of parameter estimation is *maximum likelihood estimation*. It requires .. raw:: - A data sample: :math:`\smash{{\bf y} = (y_1, \ldots, y_T)'}`. .. raw:: - A joint probability density: .. math:: f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta}) = f_{Y_1}(y_1|\boldsymbol{\theta}) \prod_{t=2}^T f_{Y_t|{\bf Y}_{t-1}}(y_t|{\bf y}_{t-1},\boldsymbol{\theta}). Likelihood ============================================================================== :math:`\smash{f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})}` is loosely interpreted as the probability of observing data sample :math:`\smash{{\bf y}_T}`, given a functional form for the density of :math:`\smash{Y_1, \ldots, Y_T}` and given a set of parameters :math:`\smash{\boldsymbol{\theta}}`. .. raw:: - We can reverse the notion and think of :math:`\smash{{\bf y}_T}` as being fixed and :math:`\smash{\boldsymbol{\theta}}` some unknown variable. .. raw:: - In this case we write :math:`\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T) = f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})}`. .. raw:: - We refer to :math:`\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)}` as the likelihood. .. raw:: - Fixing :math:`\smash{{\bf y}_T}`, maximum likelihood estimation chooses the value of :math:`\smash{\boldsymbol{\theta}}` that maximizes :math:`\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T) = f_{{\bf Y}_T}({\bf y}_T|\boldsymbol{\theta})}`. Likelihood Maximization ============================================================================== Given :math:`\smash{\boldsymbol{\theta} = (\theta_1, \ldots, \theta_p)'}`, we maximize :math:`\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)}` by .. raw:: - Differentiating with respect to each :math:`\smash{\theta_i}`, :math:`\smash{i = 1, \ldots, p}`. .. raw:: - Setting the resulting derivatives equal to zero. .. raw:: - Solving for the values :math:`\smash{\hat{\theta}_i}`, :math:`\smash{i = 1, \ldots, p}`, that make all of the derivatives zero. Log Likelihood ============================================================================== It is often easier to work with the logarithm of the likelihood function. .. raw:: - By the properties of logarithms .. math:: \begin{align*} \ell(\boldsymbol{\theta}|{\bf y}_T) & = \log\left(\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)\right) \\ & = \log\left(f_{Y_1}(y_1|\boldsymbol{\theta}) \prod_{t=2}^T f_{Y_t|{\bf Y}_{t-1}}(y_t|{\bf y}_{t-1},\boldsymbol{\theta}) \right) \\ & = \log\left(f_{Y_1}(y_1|\boldsymbol{\theta})\right) + \sum_{t=2}^T \log\left(f_{Y_t|{\bf Y}_{t-1}}(y_t|{\bf y}_{t-1},\boldsymbol{\theta})\right) \end{align*} Log Likelihood ============================================================================== - Maximizing :math:`\smash{\ell(\boldsymbol{\theta}|{\bf y}_T)}` is the same as maximizing :math:`\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)}` since :math:`\smash{\log}` is a monotonic transformation. .. raw:: - A derivative of :math:`\smash{\mathcal{L}}` will involve many product-rule terms, whereas a derivative of :math:`\smash{\ell}` will simply be a sum of derivatives. MLE Example ============================================================================== Suppose we have a dataset :math:`\smash{{\bf y}_n = (y_1, \ldots, y_n)}`, where .. math:: Y_1, \ldots, Y_n \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu,\sigma^2). .. raw:: - We will assume :math:`\smash{\mu}` is *unknown* and :math:`\smash{\sigma}` is *known*. .. raw:: - So, :math:`\smash{\boldsymbol{\theta} = \mu}` (it is a single value, rather than a vector). MLE Example ============================================================================== - The likelihood is .. math:: \begin{align*} \mathcal{L}(\mu|{\bf y}_n) & = f_{{\bf Y}_n}({\bf y}_n|\mu) \\ & = \prod_{i=1}^n f_{Y_i}(y_i|\mu) \\ & = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left\{-\frac{1}{2} \frac{(y_i - \mu)^2}{\sigma^2} \right\} \\ & = \frac{1}{(2\pi \sigma^2)^{n/2}} \exp \left\{-\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2 \right\}. \end{align*} MLE Example ============================================================================== The log likelihood is .. math:: \ell(\mu|{\bf y}_n) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2. MLE Example ============================================================================== - The MLE, :math:`\smash{\hat{\mu}}`, is the value that sets :math:`\smash{\frac{d}{d \mu} \ell(\mu|{\bf y}) = 0}`: .. raw:: .. math:: \begin{gather} \frac{d}{d \mu} \ell(\mu|{\bf y}_n) \bigg|_{\hat{\mu}} = \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \hat{\mu}) = 0 \\ \Rightarrow \sum_{i=1}^n (y_i - \hat{\mu}) = 0 \\ \Rightarrow \sum_{i=1}^n \hat{\mu} = \sum_{i=1}^n y_i \\ \Rightarrow \hat{\mu} = \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i. \end{gather} MLE Example: :math:`\smash{n=1}`, Unknown :math:`\smash{\mu}` ============================================================================== Suppose we have only one observation: :math:`\smash{y_1}`. .. raw:: - If we specialize the previous result: .. math:: \begin{align} \hat{\mu} = y_1. \end{align} .. raw:: - The density :math:`\smash{f_{Y_1}(y_1|\mu)}` gives the probability of observing some data value :math:`\smash{y_1}`, conditional on some *known* parameter :math:`\smash{\mu}`. .. raw:: - This is a normal distribution with mean :math:`\smash{\mu}` and variance :math:`\smash{\sigma^2}`. MLE Example: :math:`\smash{n=1}`, Unknown :math:`\smash{\mu}` ============================================================================== - The likelihood :math:`\smash{\mathcal{L}(\mu|y_1)}` gives the probability of :math:`\smash{\mu}`, conditional on some observed data value :math:`\smash{y_1}`. .. raw:: - This is a normal distribution with mean :math:`\smash{y_1}` and variance :math:`\smash{\sigma^2}`. MLE Example: :math:`\smash{n=1}` ============================================================================== .. image:: ../_static/MLE/densExample.png :width: 7.5in :align: center MLE Example: :math:`\smash{n=1}` ============================================================================== .. code:: ####################################################################################### # Plot likelihood for normal data with unknown mean ####################################################################################### # Generate the true normal density mu = 10; sig = 15; xGrid = seq(mu-5*sig, mu+5*sig, length=10000) trueDens = dnorm(xGrid, mu, sig) # A couple of possible data values y11 = 30; y12 = -17.7; # Plot the true normal distribution with possible data values plot(xGrid, trueDens, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='', yaxt='n', bty='L') axis(1, labels=c(expression(mu), expression(y[1]), expression(y[1])), at=c(mu, y11, y12)) abline(v=mu) segments(y11, 0, y11, dnorm(y11, mu, sig), lty=2) segments(y12, 0, y12, dnorm(y12, mu, sig), lty=2) dev.copy(png, file="densExample.png", height=600, width=1000) dev.off() MLE Example: :math:`\smash{n=1}`, Unknown :math:`\smash{\mu}` ============================================================================== .. image:: ../_static/MLE/likeExample.png :align: center MLE Example: :math:`\smash{n=1}`, Unknown :math:`\smash{\mu}` ============================================================================== .. code:: # Plot several densities for fixed data observation mu1 = -33 mu2 = 27 dens1 = dnorm(xGrid, mu1, sig) dens2 = dnorm(xGrid, mu2, sig) par(mfrow=c(2,1)) plot(xGrid, trueDens, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='', yaxt='n', bty='L') axis(1, labels=c(expression(mu^1), expression(mu^2), expression(mu^2), expression(y[1])), at=c(mu, mu1, mu2, y12)) lines(xGrid, dens1) lines(xGrid, dens2) abline(v=mu) abline(v=mu1) abline(v=mu2) segments(y12, 0, y12, dnorm(y12, mu, sig), lty=2) segments(y12, 0, y12, dnorm(y12, mu1, sig), lty=3) segments(y12, 0, y12, dnorm(y12, mu2, sig), lty=4) # Plot the resulting likelihood like = dnorm(xGrid, y12, sig) plot(xGrid, like, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='', yaxt='n', bty='L') axis(1, labels=c(expression(mu^1), expression(mu^2), expression(mu^2), expression(y[1])), at=c(mu, mu1, mu2, y12)) abline(v=y12) segments(mu, 0, mu, dnorm(mu, y12, sig), lty=2) segments(mu1, 0, mu1, dnorm(mu1, y12, sig), lty=2) segments(mu2, 0, mu2, dnorm(mu2, y12, sig), lty=2) dev.copy(png, file="likeExample.png", height=600, width=400) dev.off() MLE Example: :math:`\smash{n=1}`, Unknown :math:`\smash{\sigma}` ============================================================================== Let's continue with the assumption of one data observation, :math:`\smash{y_1}`. .. raw:: - If :math:`\smash{\mu}` is known but :math:`\smash{\sigma}` is unknown, the density of the data, :math:`\smash{y_1}`, is still normal. .. raw:: - However, the likelihood is .. math:: \mathcal{L}(\sigma^2|y_1) = \frac{\alpha}{\sigma^2} \exp\left\{-\frac{\beta}{\sigma^2}\right\} .. math:: \alpha = \frac{1}{\sqrt{2\pi}}, \qquad \beta = \frac{(y_1-\mu)^2}{2}. .. raw:: - The likelihood for :math:`\smash{\sigma^2}` is *not* normal, but *inverse gamma*. MLE Example: :math:`\smash{n=1}`, Unknown :math:`\smash{\sigma}` ============================================================================== .. image:: ../_static/MLE/likeExample2.png :width: 5in :align: center MLE Example: :math:`\smash{n=1}`, Unknown :math:`\smash{\sigma}` ============================================================================== .. code:: ####################################################################################### # Plot likelihood for normal data with unknown sd ####################################################################################### # Plot several densities for fixed data observation sig1 = 50 sig2 = 25 dens1 = dnorm(xGrid, mu, sig1) dens2 = dnorm(xGrid, mu, sig2) par(mfrow=c(2,1)) yMax = max(max(trueDens), max(dens1), max(dens2)) plot(xGrid, trueDens, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='', yaxt='n', bty='L', ylim=c(0,yMax)) axis(1, labels=c(expression(mu), expression(y[1])), at=c(mu, y12)) lines(xGrid, dens1) lines(xGrid, dens2) abline(v=mu) segments(y12, 0, y12, dnorm(y12, mu, sig), lty=2) segments(y12, 0, y12, dnorm(y12, mu, sig1), lty=3) segments(y12, 0, y12, dnorm(y12, mu, sig2), lty=4) xDist = max(xGrid)-mu text(c(mu+0.29*xDist, mu+0.4*xDist, mu+0.7*xDist), c(0.66, 0.4, 0.23)*yMax, labels=c(expression(sigma[1]^2), expression(sigma[2]^2), expression(sigma[3]^2))) # Plot the resulting likelihood (which is an inverse gamma) alpha = -0.5 beta = ((y12 - mu)^2)/2 scale = 1/sqrt(2*pi) sigGrid = seq(0.01,3000,length=10000) like = scale*(sigGrid^(-alpha-1))*exp(-beta/sigGrid) likeTrue = scale*(sig^(-2*alpha-2))*exp(-beta/sig^2) like1 = scale*(sig1^(-2*alpha-2))*exp(-beta/sig1^2) like2 = scale*(sig2^(-2*alpha-2))*exp(-beta/sig2^2) plot(sigGrid, like, type='l', xaxs='i', yaxs='i', xaxt='n', xlab='', ylab='', yaxt='n', bty='L') axis(1, labels=c(expression(sigma[1]^2), expression(sigma[2]^2), expression(sigma[3]^2)), at=c(sig^2, sig1^2, sig2^2)) segments(sig^2, min(like), sig^2, likeTrue, lty=2) segments(sig1^2, min(like), sig1^2, like1, lty=2) segments(sig2^2, min(like), sig2^2, like2, lty=2) dev.copy(png, file="likeExample2.png", height=600, width=400) dev.off() MLE Accuracy ============================================================================== Maximum likelihood estimation results in estimates of true unknown parameters. .. raw:: - What is the probability that our estimates are identical to the true population parameters? .. raw:: - Our estimates are imprecise and contain error. .. raw:: - We would like to quantify the precision of our estimates with standard errors. .. raw:: - We will use the *Fisher Information* to compute standard errors. Fisher Information ============================================================================== Let :math:`\smash{\mathcal{H}(\boldsymbol{\theta}|{\bf y}_t) = \frac{d^2}{d \boldsymbol{\theta}^2} \ell(\boldsymbol{\theta}|{\bf y}_T)}`, the matrix of second derivatives of the log likelihood. .. raw:: - The Fisher Information is .. math:: \mathcal{I}(\boldsymbol{\theta}) = - E\left[ \mathcal{H}(\boldsymbol{\theta}|{\bf y}_t)\right]. .. raw:: - The observed Fisher Information is .. math:: \widetilde{\mathcal{I}}(\boldsymbol{\theta}) = - \mathcal{H}(\boldsymbol{\theta}|{\bf y}_t). Fisher Information ============================================================================== - Observed Fisher Information does not take an expectation, which may be difficult to compute. .. raw:: - Since :math:`\smash{\ell(\boldsymbol{\theta}|{\bf y}_T)}` is often a sum of many terms, :math:`\smash{\widetilde{\mathcal{I}}(\boldsymbol{\theta})}` will converge to :math:`\smash{\mathcal{I}(\boldsymbol{\theta})}` for large samples. MLE Central Limit Theorem ============================================================================== Under certain conditions, a central limit theorem holds for the MLE, :math:`\smash{\hat{\boldsymbol{\theta}}}`. .. raw:: - For infinitely large samples :math:`\smash{{\bf y}_T}`, .. math:: \hat{\boldsymbol{\theta}} \sim \mathcal{N}(\boldsymbol{\theta}, \mathcal{I}(\boldsymbol{\theta})^{-1}). .. raw:: - For large samples, :math:`\smash{\hat{\boldsymbol{\theta}}}` is normally distributed *regardless* of the distribution of the data, :math:`\smash{{\bf y}_T}`. MLE Central Limit Theorem ============================================================================== - :math:`\smash{\hat{\boldsymbol{\theta}}}` is also normally distributed for large samples even if :math:`\smash{\mathcal{L}(\boldsymbol{\theta}|{\bf y}_T)}` is some other distribution. .. raw:: - Hence, for large samples, .. math:: Var(\hat{\theta}_i) = \frac{1}{\mathcal{I}(\boldsymbol{\theta})_{ii}} \qquad \Rightarrow \qquad Std(\hat{\theta}_i) = \frac{1}{\sqrt{\mathcal{I}(\boldsymbol{\theta})_{ii}}} MLE Standard Errors ============================================================================== Since we don't know the true :math:`\smash{\boldsymbol{\theta}}`, we approximate .. math:: Std(\hat{\theta}_i) \approx \frac{1}{\sqrt{\mathcal{I}(\hat{\boldsymbol{\theta}})_{ii}}}. .. raw:: - Alternatively, to avoid computing the expectation, we could use the approximation .. math:: Std(\hat{\theta}_i) \approx \frac{1}{\sqrt{\widetilde{\mathcal{I}}(\hat{\boldsymbol{\theta}})_{ii}}}. MLE Standard Errors ============================================================================== - In reality, we never have an infinite sample size. .. raw:: - For finite samples, these values are approximations of the standard errors of the components of :math:`\smash{\hat{\boldsymbol{\theta}}}`. MLE Variance Example ============================================================================== Let's return to the example where :math:`\smash{Y_1, \ldots, Y_n \stackrel{i.i.d.}{\sim} \mathcal{N}(\mu, \sigma^2)}`, with known :math:`\smash{\sigma}`. .. raw:: - The log likelihood is .. math:: \ell(\mu|{\bf y}_n) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu)^2. .. raw:: - The resulting derivatives are .. math:: \frac{\partial \ell(\mu|{\bf y}_n)}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \mu), \qquad \frac{\partial^2 \ell(\mu|{\bf y}_n)}{\partial \mu^2} = -\frac{n}{\sigma^2}. MLE Variance Example ============================================================================== In this case the Fisher Information is identical to the observed Fisher Information: .. math:: \mathcal{I}(\mu) = -E\left[-\frac{n}{\sigma^2}\right] = \frac{n}{\sigma^2} = \widetilde{\mathcal{I}}(\mu). .. raw:: - Since :math:`\smash{\mathcal{I}(\mu)}` doesn't depend on :math:`\smash{\mu}`, we don't need to resort to an approximation with :math:`\smash{\hat{\mu} = \bar{{\bf y}}}`. .. raw:: - The result is .. math:: Std(\hat{\mu}) = \frac{1}{\sqrt{\mathcal{I}(\mu)}} = \frac{\sigma}{\sqrt{n}}.