Numerical Optimization¶
Numerical Maximum Likelihood¶
Given, \(\smash{\boldsymbol{\theta}}\) and \(\smash{\boldsymbol{y}}\), suppose we can compute the value of a
likelihood or log likelihood.
- Likelhood optimization is often very challenging.
- We may not be able to obtain analytical expressions for the MLEs, \(\smash{\hat{\boldsymbol{\theta}}}\).
- Numerical optimization techniques will often help us find an approximate (not exact) MLE.
- We will need to set a tolerance level for the quality of our approximation.
Grid Search¶
Let \(\smash{\boldsymbol{\theta} \in \mathbb{R}^k}\).
- We can define a univariate grid of \(\smash{m_i}\) points \(\smash{\theta_i \in \Theta^{(i)} = \{\theta_{i,1}, \ldots, \theta_{i,m_i}\}}\) for \(\smash{i=1,\ldots,k}\).
- Define \(\smash{\Theta = \Theta^{(1)} \otimes \Theta^{(2)} \otimes \cdots \otimes \Theta^{(k)}}\), which is the cartesian product of the \(\smash{k}\) univariate grids.
- Often such grids are equally spaced, but this is certainly not required.
- Optimal location of grid points is an extremely important way to improve numerical efficiency.
Grid Search¶
To implement grid search, we simply evaluate the likelihood at each value in \(\smash{\Theta}\).
- Each value in \(\smash{\Theta}\) defines a set of candidate parameter values.
- The approximated MLE is the point that achieves the highest likelihood or log likelihood value.
Grid search is ineffective for large \(\smash{k}\) because the number of grid points in \(\smash{\Theta}\) grows exponentially.
- Doubling the number of points (for more accuracy) in each dimension results in \(\smash{2^k}\) extra points.
- This is called the curse of dimensionality.
\(\smash{AR(1)}\) Grid Search¶
Suppose \(\smash{c=0}\) and \(\smash{\sigma^2 = 1}\).
- In this case, \(\smash{\boldsymbol{\theta} = \phi}\) and \(\smash{k=1}\).
- Under stationarity, we know \(\smash{|\phi| < 1}\), so we might define an equally-spaced grid of values \(\smash{\{-0.99,-0.98,\ldots, 0.98,0.99\}}\).
- Given data \(\smash{\boldsymbol{y}}\), we can compute the exact or conditional likelihood for each \(\smash{\phi_i}\) in the grid.
- The \(\smash{\phi_i}\) that results in the highest likelihood value is the approximate MLE, which we denote \(\smash{\phi^*}\).
- We can iteratively refine the grid around \(\smash{\phi^*}\) until our tolerance is reached.
Binary Search¶
Binary search is an optimization method that is far more efficient than grid search, for univariate problems.
- It can only be used if the criterion function is concave.
- The algorithm is
- Pick two adjacent points \(\smash{\theta_j}\) and \(\smash{\theta_{j+1}}\) in the middle of the grid and evaluate the likelihood.
- If \(\smash{\mathcal{L}(\theta_{j+1}) < \mathcal{L}(\theta_j)}\), set the upper bound of the grid to be \(\smash{\theta_{j+1}}\) and otherwise set the lower bound to be \(\smash{\theta_j}\).
- If the lower and upper bounds are separated by more than one grid point, return to step 1. Otherwise, stop.
- Golden search is similar to binary search. See Heer and Maussner (2009) for details.
Newton’s Method¶
Newton’s method is an iterative root finding algorithm, that uses derivative/gradient information:
The value \(\smash{x^{(n)}}\) for large \(\smash{n}\) is an approximation of the function root, \(\smash{x: f(x) = 0}\).
Newton-Raphson¶
Newton’s method can also be used for optimization (not just root finding).
- Optimization is the same as root finding for the derivative function.
- The Newton-Raphson algorithm is
Newton-Raphson¶
Define
where \(\smash{\mathcal{H}(\boldsymbol{\theta})}\) is positive definite:
Newton-Raphson¶
We approximate \(\smash{\ell(\boldsymbol{\theta})}\) with a second-order Taylor expansion around \(\smash{\boldsymbol{\theta}^{(0)}}\):
The Newton-Raphson method chooses \(\smash{\boldsymbol{\theta}^{(1)}}\) to maximize \(\smash{\tilde{\ell}(\boldsymbol{\theta})}\):
Newton-Raphson¶
The Newton-Raphson method begins with \(\smash{\boldsymbol{\theta}^{(0)}}\) and iteratively computes
until \(\smash{||\boldsymbol{\theta}^{(i+1)} - \boldsymbol{\theta}^{(i)}|| < \tau}\), where \(\smash{\tau}\) is some tolerance level.
Newton-Raphson¶
- Newton-Raphson converges fast if the likelihood is concave and the initial guess is good enough.
- A modified version of Newton-Raphson computes:
for various values of \(\smash{s}\) at each iteration and chooses \(\smash{\boldsymbol{\theta}^{(i+1)}}\) that yields the largest likelihood value.
Quasi Newton-Raphson¶
Various modified Newton-Raphson methods have been proposed which substitute other positive definite matrices for \(\smash{\mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1}}\).
- These are useful if \(\smash{\mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1}}\) is not possible to compute or invert.
- Typically these are slower but more robust.
Numerical Differentiation¶
If analytical derivatives are not possible, numerical derivatives are an option.
- The \(\smash{i}\) th element of \(\smash{\boldsymbol{g}(\boldsymbol{\theta})}\) can be approximated with:
for some small \(\smash{\Delta}\).
- The hessian can be computed numerically from \(\smash{\boldsymbol{g}(\boldsymbol{\theta})}\) in a similar manner.