============================================================================== Numerical Optimization ============================================================================== Numerical Maximum Likelihood ============================================================================== Given, :math:`\smash{\boldsymbol{\theta}}` and :math:`\smash{\boldsymbol{y}}`, suppose we can compute the value of a likelihood or log likelihood. .. raw:: - Likelhood optimization is often very challenging. .. raw:: - We may not be able to obtain analytical expressions for the MLEs, :math:`\smash{\hat{\boldsymbol{\theta}}}`. .. raw:: - Numerical optimization techniques will often help us find an approximate (not exact) MLE. .. raw:: - We will need to set a tolerance level for the quality of our approximation. Grid Search ============================================================================== Let :math:`\smash{\boldsymbol{\theta} \in \mathbb{R}^k}`. .. raw:: - We can define a univariate grid of :math:`\smash{m_i}` points :math:`\smash{\theta_i \in \Theta^{(i)} = \{\theta_{i,1}, \ldots, \theta_{i,m_i}\}}` for :math:`\smash{i=1,\ldots,k}`. .. raw:: - Define :math:`\smash{\Theta = \Theta^{(1)} \otimes \Theta^{(2)} \otimes \cdots \otimes \Theta^{(k)}}`, which is the cartesian product of the :math:`\smash{k}` univariate grids. .. raw:: - Often such grids are equally spaced, but this is certainly not required. .. raw:: - Optimal location of grid points is an extremely important way to improve numerical efficiency. Grid Search ============================================================================== To implement grid search, we simply evaluate the likelihood at each value in :math:`\smash{\Theta}`. .. raw:: - Each value in :math:`\smash{\Theta}` defines a set of candidate parameter values. .. raw:: - The approximated MLE is the point that achieves the highest likelihood or log likelihood value. .. raw:: Grid search is ineffective for large :math:`\smash{k}` because the number of grid points in :math:`\smash{\Theta}` grows exponentially. .. raw:: - Doubling the number of points (for more accuracy) in each dimension results in :math:`\smash{2^k}` extra points. .. raw:: - This is called the curse of dimensionality. :math:`\smash{AR(1)}` Grid Search ============================================================================== Suppose :math:`\smash{c=0}` and :math:`\smash{\sigma^2 = 1}`. .. raw:: - In this case, :math:`\smash{\boldsymbol{\theta} = \phi}` and :math:`\smash{k=1}`. .. raw:: - Under stationarity, we know :math:`\smash{|\phi| < 1}`, so we might define an equally-spaced grid of values :math:`\smash{\{-0.99,-0.98,\ldots, 0.98,0.99\}}`. .. raw:: - Given data :math:`\smash{\boldsymbol{y}}`, we can compute the exact or conditional likelihood for each :math:`\smash{\phi_i}` in the grid. .. raw:: - The :math:`\smash{\phi_i}` that results in the highest likelihood value is the approximate MLE, which we denote :math:`\smash{\phi^*}`. .. raw:: - We can iteratively refine the grid around :math:`\smash{\phi^*}` until our tolerance is reached. Binary Search ============================================================================== Binary search is an optimization method that is far more efficient than grid search, for *univariate* problems. .. raw:: - It can only be used if the criterion function is concave. .. raw:: - The algorithm is .. raw:: 1. Pick two adjacent points :math:`\smash{\theta_j}` and :math:`\smash{\theta_{j+1}}` in the middle of the grid and evaluate the likelihood. .. raw:: 2. If :math:`\smash{\mathcal{L}(\theta_{j+1}) < \mathcal{L}(\theta_j)}`, set the upper bound of the grid to be :math:`\smash{\theta_{j+1}}` and otherwise set the lower bound to be :math:`\smash{\theta_j}`. .. raw:: 3. If the lower and upper bounds are separated by more than one grid point, return to step 1. Otherwise, stop. .. raw:: - Golden search is similar to binary search. See Heer and Maussner (2009) for details. Newton's Method ============================================================================== Newton's method is an iterative root finding algorithm, that uses derivative/gradient information: .. math:: \smash{x^{(i+1)} = x^{(i)} - f(x^{(i)})/f'(x^{(i)}).} The value :math:`\smash{x^{(n)}}` for large :math:`\smash{n}` is an approximation of the function root, :math:`\smash{x: f(x) = 0}`. Newton-Raphson ============================================================================== Newton's method can also be used for optimization (not just root finding). .. raw:: - Optimization is the same as root finding for the derivative function. .. raw:: - The Newton-Raphson algorithm is .. math:: \smash{x^{(i+1)} = x^{(i)} - f'(x^{(i)})/f''(x^{(i)}).} Newton-Raphson ============================================================================== Define .. math:: \begin{align*} \boldsymbol{g}(\boldsymbol{\theta}) & = \nabla \ell(\boldsymbol{\theta}) = \frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \\ \mathcal{H}(\boldsymbol{\theta}) & = \nabla^2 \ell(\boldsymbol{\theta}) = \nabla \boldsymbol{g}(\boldsymbol{\theta}) = \frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}^2}, \end{align*} where :math:`\smash{\mathcal{H}(\boldsymbol{\theta})}` is positive definite: .. math:: \smash{\boldsymbol{x}^T \mathcal{H}(\boldsymbol{\theta}) \boldsymbol{x} > 0 \,\,\,\, \forall \boldsymbol{x} \in \mathbb{R}^k.} Newton-Raphson ============================================================================== We approximate :math:`\smash{\ell(\boldsymbol{\theta})}` with a second-order Taylor expansion around :math:`\smash{\boldsymbol{\theta}^{(0)}}`: .. math:: \smash{\tilde{\ell}(\boldsymbol{\theta}) = \ell(\boldsymbol{\theta}^{(0)})} + \boldsymbol{g}(\boldsymbol{\theta}^{(0)})^T (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)}) + \frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)})^T \mathcal{H}(\boldsymbol{\theta}^{(0)}) (\boldsymbol{\theta} - \boldsymbol{\theta}^{(0)}). .. raw:: The Newton-Raphson method chooses :math:`\smash{\boldsymbol{\theta}^{(1)}}` to maximize :math:`\smash{\tilde{\ell}(\boldsymbol{\theta})}`: .. math:: \begin{gather*} \nabla \tilde{\ell}(\boldsymbol{\theta})\Big|_{\boldsymbol{\theta}=\boldsymbol{\theta}^{(1)}} = \boldsymbol{g}(\boldsymbol{\theta}^{(0)}) + \mathcal{H}(\boldsymbol{\theta}^{(0)}) (\boldsymbol{\theta}^{(1)} - \boldsymbol{\theta}^{(0)}) = 0. \\ \implies \boldsymbol{\theta}^{(1)} = \boldsymbol{\theta}^{(0)} - \mathcal{H}(\boldsymbol{\theta}^{(0)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(0)}). \end{gather*} Newton-Raphson ============================================================================== The Newton-Raphson method begins with :math:`\smash{\boldsymbol{\theta}^{(0)}}` and iteratively computes .. math:: \smash{\boldsymbol{\theta}^{(i+1)} = \boldsymbol{\theta}^{(i)} - \mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(i)})} until :math:`\smash{||\boldsymbol{\theta}^{(i+1)} - \boldsymbol{\theta}^{(i)}|| < \tau}`, where :math:`\smash{\tau}` is some tolerance level. Newton-Raphson ============================================================================== - Newton-Raphson converges fast if the likelihood is concave and the initial guess is good enough. .. raw:: - A modified version of Newton-Raphson computes: .. math:: \smash{\boldsymbol{\theta}^{(i+1)} = \boldsymbol{\theta}^{(i)} - s \mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1} \boldsymbol{g}(\boldsymbol{\theta}^{(i)})} for various values of :math:`\smash{s}` at each iteration and chooses :math:`\smash{\boldsymbol{\theta}^{(i+1)}}` that yields the largest likelihood value. Quasi Newton-Raphson ============================================================================== Various modified Newton-Raphson methods have been proposed which substitute other positive definite matrices for :math:`\smash{\mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1}}`. .. raw:: - These are useful if :math:`\smash{\mathcal{H}(\boldsymbol{\theta}^{(i)})^{-1}}` is not possible to compute or invert. .. raw:: - Typically these are slower but more robust. Numerical Differentiation ============================================================================== If analytical derivatives are not possible, numerical derivatives are an option. .. raw:: - The :math:`\smash{i}` th element of :math:`\smash{\boldsymbol{g}(\boldsymbol{\theta})}` can be approximated with: .. math:: \smash{g_i(\boldsymbol{\theta}) = \frac{1}{\Delta} \left(\ell(\theta_1,\ldots,\theta_i+\Delta,\ldots,\theta_k) - \ell(\theta_1,\ldots,\theta_i,\ldots,\theta_k)\right)}, for some small :math:`\smash{\Delta}`. .. raw:: - The hessian can be computed numerically from :math:`\smash{\boldsymbol{g}(\boldsymbol{\theta})}` in a similar manner.