Numerical Optimization

Numerical Maximum Likelihood

Given, \(\smash{\boldsymbol{\theta}}\) and \(\smash{\boldsymbol{y}}\), suppose we can compute the value of a

likelihood or log likelihood.
  • Likelhood optimization is often very challenging.
  • We may not be able to obtain analytical expressions for the MLEs, \(\smash{\hat{\boldsymbol{\theta}}}\).
  • Numerical optimization techniques will often help us find an approximate (not exact) MLE.
  • We will need to set a tolerance level for the quality of our approximation.

Grid Search

To implement grid search, we simply evaluate the likelihood at each value in \(\smash{\Theta}\).

  • Each value in \(\smash{\Theta}\) defines a set of candidate parameter values.
  • The approximated MLE is the point that achieves the highest likelihood or log likelihood value.

Grid search is ineffective for large \(\smash{k}\) because the number of grid points in \(\smash{\Theta}\) grows exponentially.

  • Doubling the number of points (for more accuracy) in each dimension results in \(\smash{2^k}\) extra points.
  • This is called the curse of dimensionality.

Newton’s Method

Newton’s method is an iterative root finding algorithm, that uses derivative/gradient information:

\[\smash{x^{(i+1)} = x^{(i)} - f(x^{(i)})/f'(x^{(i)}).}\]

The value \(\smash{x^{(n)}}\) for large \(\smash{n}\) is an approximation of the function root, \(\smash{x: f(x) = 0}\).

Newton-Raphson

Newton’s method can also be used for optimization (not just root finding).

  • Optimization is the same as root finding for the derivative function.
  • The Newton-Raphson algorithm is
\[\smash{x^{(i+1)} = x^{(i)} - f'(x^{(i)})/f''(x^{(i)}).}\]

Newton-Raphson

Define

\[\begin{split}\begin{align*} \boldsymbol{g}(\boldsymbol{\theta}) & = \nabla \ell(\boldsymbol{\theta}) = \frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ \mathcal{H}(\boldsymbol{\theta}) & = \nabla^2 \ell(\boldsymbol{\theta}) = \nabla \boldsymbol{g}(\boldsymbol{\theta}) = \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}^2}, \end{align*}\end{split}\]

where \(\smash{\mathcal{H}(\boldsymbol{\theta})}\) is positive definite:

\[\smash{\boldsymbol{x}^T \mathcal{H}(\boldsymbol{\theta}) \boldsymbol{x} > 0 \,\,\,\, \forall \boldsymbol{x} \in \mathbb{R}^k.}\]

Newton-Raphson

We approximate \(\smash{\ell(\boldsymbol{\theta})}\) with a second-order Taylor expansion around \(\smash{\boldsymbol{\theta}^{(0)}}\):

\[\smash{\tilde{\ell}(\boldsymbol{\theta}) = \ell(\boldsymbol{\theta}^{(0)})} + \boldsymbol{g}(\boldsymbol{\theta}^{(0)})^T (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)}) + \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)})^T \mathcal{H}(\boldsymbol{\theta}^{(0)}) (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)}).\]

The Newton-Raphson method chooses \(\smash{\boldsymbol{\theta}^{(1)}}\) to maximize \(\smash{\tilde{\ell}(\boldsymbol{\theta})}\):

\[\begin{split}\begin{gather*} \nabla \tilde{\ell}(\boldsymbol{\theta})\Big|_{\boldsymbol{\theta}=\boldsymbol{\theta}^{(1)}} = \boldsymbol{g}(\boldsymbol{\theta}^{(0)}) + \mathcal{H}(\boldsymbol{\theta}^{(0)}) (\boldsymbol{\theta}^{(1)} - \boldsymbol{\theta}^{(0)}) = 0. \\ \implies \boldsymbol{\theta}^{(1)} = \boldsymbol{\theta}^{(0)} - \mathcal{H}(\boldsymbol{\theta}^{(0)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(0)}). \end{gather*}\end{split}\]

Newton-Raphson

The Newton-Raphson method begins with \(\smash{\boldsymbol{\theta}^{(0)}}\) and iteratively computes

\[\smash{\boldsymbol{\theta}^{(i+1)} = \boldsymbol{\theta}^{(i)} - \mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(i)})}\]

until \(\smash{||\boldsymbol{\theta}^{(i+1)} - \boldsymbol{\theta}^{(i)}|| < \tau}\), where \(\smash{\tau}\) is some tolerance level.

Newton-Raphson

  • Newton-Raphson converges fast if the likelihood is concave and the initial guess is good enough.
  • A modified version of Newton-Raphson computes:
\[\smash{\boldsymbol{\theta}^{(i+1)} = \boldsymbol{\theta}^{(i)} - s \mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(i)})}\]

for various values of \(\smash{s}\) at each iteration and chooses \(\smash{\boldsymbol{\theta}^{(i+1)}}\) that yields the largest likelihood value.

Quasi Newton-Raphson

Various modified Newton-Raphson methods have been proposed which substitute other positive definite matrices for \(\smash{\mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1}}\).

  • These are useful if \(\smash{\mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1}}\) is not possible to compute or invert.
  • Typically these are slower but more robust.

Numerical Differentiation

If analytical derivatives are not possible, numerical derivatives are an option.

  • The \(\smash{i}\) th element of \(\smash{\boldsymbol{g}(\boldsymbol{\theta})}\) can be approximated with:
\[\smash{g_i(\boldsymbol{\theta}) = \frac{1}{\Delta} \left(\ell(\theta_1,\ldots,\theta_i+\Delta,\ldots,\theta_k) - \ell(\theta_1,\ldots,\theta_i,\ldots,\theta_k)\right)},\]

for some small \(\smash{\Delta}\).

  • The hessian can be computed numerically from \(\smash{\boldsymbol{g}(\boldsymbol{\theta})}\) in a similar manner.