Numerical Optimization¶

Numerical Maximum Likelihood¶

Given, \(\smash{\boldsymbol{\theta}}\) and \(\smash{\boldsymbol{y}}\), suppose we can compute the value of a

likelihood or log likelihood.

Likelhood optimization is often very challenging.

We may not be able to obtain analytical expressions for the MLEs, \(\smash{\hat{\boldsymbol{\theta}}}\).

Numerical optimization techniques will often help us find an approximate (not exact) MLE.

We will need to set a tolerance level for the quality of our approximation.

Grid Search¶

Let \(\smash{\boldsymbol{\theta} \in \mathbb{R}^k}\).

We can define a univariate grid of \(\smash{m_i}\) points \(\smash{\theta_i \in \Theta^{(i)} = \{\theta_{i,1}, \ldots, \theta_{i,m_i}\}}\) for \(\smash{i=1,\ldots,k}\).

Define \(\smash{\Theta = \Theta^{(1)} \otimes \Theta^{(2)} \otimes \cdots \otimes \Theta^{(k)}}\), which is the cartesian product of the \(\smash{k}\) univariate grids.

Often such grids are equally spaced, but this is certainly not required.

Optimal location of grid points is an extremely important way to improve numerical efficiency.

Grid Search¶

To implement grid search, we simply evaluate the likelihood at each value in \(\smash{\Theta}\).

Each value in \(\smash{\Theta}\) defines a set of candidate parameter values.

The approximated MLE is the point that achieves the highest likelihood or log likelihood value.

Grid search is ineffective for large \(\smash{k}\) because the number of grid points in \(\smash{\Theta}\) grows exponentially.

Doubling the number of points (for more accuracy) in each dimension results in \(\smash{2^k}\) extra points.

This is called the curse of dimensionality.

\(\smash{AR(1)}\) Grid Search¶

Suppose \(\smash{c=0}\) and \(\smash{\sigma^2 = 1}\).

In this case, \(\smash{\boldsymbol{\theta} = \phi}\) and \(\smash{k=1}\).

Under stationarity, we know \(\smash{|\phi| < 1}\), so we might define an equally-spaced grid of values \(\smash{\{-0.99,-0.98,\ldots, 0.98,0.99\}}\).

Given data \(\smash{\boldsymbol{y}}\), we can compute the exact or conditional likelihood for each \(\smash{\phi_i}\) in the grid.

The \(\smash{\phi_i}\) that results in the highest likelihood value is the approximate MLE, which we denote \(\smash{\phi^*}\).

We can iteratively refine the grid around \(\smash{\phi^*}\) until our tolerance is reached.

Binary Search¶

Binary search is an optimization method that is far more efficient than grid search, for univariate problems.

It can only be used if the criterion function is concave.

The algorithm is

Pick two adjacent points \(\smash{\theta_j}\) and \(\smash{\theta_{j+1}}\) in the middle of the grid and evaluate the likelihood.

If \(\smash{\mathcal{L}(\theta_{j+1}) < \mathcal{L}(\theta_j)}\), set the upper bound of the grid to be \(\smash{\theta_{j+1}}\) and otherwise set the lower bound to be \(\smash{\theta_j}\).

If the lower and upper bounds are separated by more than one grid point, return to step 1. Otherwise, stop.

Golden search is similar to binary search. See Heer and Maussner (2009) for details.

Newton’s Method¶

Newton’s method is an iterative root finding algorithm, that uses derivative/gradient information:

\[\smash{x^{(i+1)} = x^{(i)} - f(x^{(i)})/f'(x^{(i)}).}\]

The value \(\smash{x^{(n)}}\) for large \(\smash{n}\) is an approximation of the function root, \(\smash{x: f(x) = 0}\).

Newton-Raphson¶

Newton’s method can also be used for optimization (not just root finding).

Optimization is the same as root finding for the derivative function.

The Newton-Raphson algorithm is

\[\smash{x^{(i+1)} = x^{(i)} - f'(x^{(i)})/f''(x^{(i)}).}\]

Newton-Raphson¶

Define

\[\begin{split}\begin{align*} \boldsymbol{g}(\boldsymbol{\theta}) & = \nabla \ell(\boldsymbol{\theta}) = \frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ \mathcal{H}(\boldsymbol{\theta}) & = \nabla^2 \ell(\boldsymbol{\theta}) = \nabla \boldsymbol{g}(\boldsymbol{\theta}) = \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}^2}, \end{align*}\end{split}\]

where \(\smash{\mathcal{H}(\boldsymbol{\theta})}\) is positive definite:

\[\smash{\boldsymbol{x}^T \mathcal{H}(\boldsymbol{\theta}) \boldsymbol{x} > 0 \,\,\,\, \forall \boldsymbol{x} \in \mathbb{R}^k.}\]

Newton-Raphson¶

We approximate \(\smash{\ell(\boldsymbol{\theta})}\) with a second-order Taylor expansion around \(\smash{\boldsymbol{\theta}^{(0)}}\):

\[\smash{\tilde{\ell}(\boldsymbol{\theta}) = \ell(\boldsymbol{\theta}^{(0)})} + \boldsymbol{g}(\boldsymbol{\theta}^{(0)})^T (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)}) + \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)})^T \mathcal{H}(\boldsymbol{\theta}^{(0)}) (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)}).\]

The Newton-Raphson method chooses \(\smash{\boldsymbol{\theta}^{(1)}}\) to maximize \(\smash{\tilde{\ell}(\boldsymbol{\theta})}\):

\[\begin{split}\begin{gather*} \nabla \tilde{\ell}(\boldsymbol{\theta})\Big|_{\boldsymbol{\theta}=\boldsymbol{\theta}^{(1)}} = \boldsymbol{g}(\boldsymbol{\theta}^{(0)}) + \mathcal{H}(\boldsymbol{\theta}^{(0)}) (\boldsymbol{\theta}^{(1)} - \boldsymbol{\theta}^{(0)}) = 0. \\ \implies \boldsymbol{\theta}^{(1)} = \boldsymbol{\theta}^{(0)} - \mathcal{H}(\boldsymbol{\theta}^{(0)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(0)}). \end{gather*}\end{split}\]

Newton-Raphson¶

The Newton-Raphson method begins with \(\smash{\boldsymbol{\theta}^{(0)}}\) and iteratively computes

\[\smash{\boldsymbol{\theta}^{(i+1)} = \boldsymbol{\theta}^{(i)} - \mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(i)})}\]

until \(\smash{||\boldsymbol{\theta}^{(i+1)} - \boldsymbol{\theta}^{(i)}|| < \tau}\), where \(\smash{\tau}\) is some tolerance level.

Newton-Raphson¶

Newton-Raphson converges fast if the likelihood is concave and the initial guess is good enough.

A modified version of Newton-Raphson computes:

\[\smash{\boldsymbol{\theta}^{(i+1)} = \boldsymbol{\theta}^{(i)} - s \mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(i)})}\]

for various values of \(\smash{s}\) at each iteration and chooses \(\smash{\boldsymbol{\theta}^{(i+1)}}\) that yields the largest likelihood value.

Quasi Newton-Raphson¶

Various modified Newton-Raphson methods have been proposed which substitute other positive definite matrices for \(\smash{\mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1}}\).

These are useful if \(\smash{\mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1}}\) is not possible to compute or invert.

Typically these are slower but more robust.

Numerical Differentiation¶

If analytical derivatives are not possible, numerical derivatives are an option.

The \(\smash{i}\) th element of \(\smash{\boldsymbol{g}(\boldsymbol{\theta})}\) can be approximated with:

\[\smash{g_i(\boldsymbol{\theta}) = \frac{1}{\Delta} \left(\ell(\theta_1,\ldots,\theta_i+\Delta,\ldots,\theta_k) - \ell(\theta_1,\ldots,\theta_i,\ldots,\theta_k)\right)},\]

for some small \(\smash{\Delta}\).

The hessian can be computed numerically from \(\smash{\boldsymbol{g}(\boldsymbol{\theta})}\) in a similar manner.