The recommended method of obtaining the software is by cloning the repository with Git. Otherwise, a ZIP archive can be downloaded from the repository page.
The repository contains several different implementations of the problem described in the next section:
The side-by-side implementations allow users to learn the basics of massively parallel programming by comparing code with familiar benchmarks (Matlab or sequential C++).
Aldrich et al. (2011) solves a neoclassical growth model wih value function iteration (VFI) on a GPU. The basic problem is summarized by the optimization problem \begin{equation} V(K,Z) = \max_c \left\{ \frac{C^{1-\gamma}}{1-\gamma} + \beta \mathbb{E}[V(K',Z')|Z]\right\} \label{bellman} \end{equation} subject to \[\begin{align} K' & = Z K^{\alpha} + (1-\delta) K - C \label{rc} \\ \log(Z') & = \rho \log(Z) + \varepsilon, \text{ where } \varepsilon \sim \mathcal{N}(0,\sigma^2). \label{tfp} \end{align}\] The state variables in this problem are capital, \(K\), and total factor of productivity (TFP), \(Z\).
The VFI software solves the problem with the following algorithm.
A basic implementation would involve computing the quantities in Equations \eqref{cons} - \eqref{continuation} in a serial fashion for each value of \(K' \in \mathcal{K}\) in the loop at Step 8. If either \(\mathcal{K}\) or \(\mathcal{Z}\) is a very dense grid, Step 8 may involve many thousands of serial calculations for each of the values in the loops at Steps 6 and 7.
Alternatively, with many parallel processors available, the loops at Steps 6 and 7 could be eliminated and the sequence of instructions nested in Step 8 could be assigned to an individual processor and computed in parallel for each pair \((K,Z)\). The reason that parallelism can be exploited in this problem is that the computations nested within Steps 6 and 7 depend only on the concurrent \((K,Z)\) and not on other values in \(\mathcal{K}\) and \(\mathcal{Z}\).
This section reports timing results for solving the model on a 4U rackmount server with a single quad-core Intel Xeon 2.4 GHz CPU and two NVIDIA Tesla C2075 GPUs. Only one of the GPUs was utilized for computation of the Thrust/GPU and CUDA C implementations and all four cores of the CPU were used for the Thrust/OpenMP implementation. The model was solved using the parameter values below.
\begin{array}{cccccc} \hline \beta & \gamma & \alpha & \delta & \rho & \sigma \\ \hline 0.984 & 2 & 0.35 & 0.01 & 0.95 & 0.005 \\ \hline \end{array}
The following two tables report times for the implementations above. \(N_z = 4\) and \(N_k\) was incremented to assess the relative performance of the GPU for an increasingly dense grid of state space values. All results are computed in double precision and ratios are relative to C++ times. The first table reports results using a binary search algorithm for maximization, while the second table reports results for a naive grid search. The grid search method was able to exploit policy function iteration, only performing the maximization on the right-hand side of Equation \eqref{bellman} every 20 iterations of the value function. This was not possible for the binary search algorithm, since it did not preserve the concavity of the value function, which is crucial for binary search. For more details regarding the algorithms and solutions, see Aldrich et al. (2011).
\begin{array}{l|cccccccccc} \hline N_k & 128 & 256 & 512 & 1,024 & 2,048 & 4,096 & 8,192 & 16,384 & 32,768 & 65,536 \\ \hline \text{C++} & 0.547 & 1.35 & 3.41 & 9.05 & 25.73 & 84.58 & 297.32 & 1,114.95 & 4,653.81 & 19,421.90 \\ \hline \text{Matlab} & 48.97 & 98.36 & 203.18 & 426.57 & 920.10 & 2,077.24 & 5,020.26 & 16,129.32 & 45,070.47 & 140,341.10 \\ \text{Matlab Ratio} & 89.55 & 72.67 & 59.50 & 47.13 & 35.76 & 24.56 & 16.89 & 14.47 & 9.68 & 7.23 \\ \hline \text{Thrust/OpenMP} & 0.118 & 0.241 & 0.519 & 1.10 & 2.37 & 5.04 & 10.81 & 23.10 & 49.53 & 106.66 \\ \text{Thrust/OpenMP Ratio} & 0.217 & 0.178 & 0.152 & 0.121 & 0.0920 & 0.0595 & 0.0364 & 0.0207 & 0.0106 & 0.00549 \\ \hline \text{Thrust/CUDA Start} & 6.75 & 6.70 & 6.74 & 6.64 & 6.72 & 6.62 & 6.72 & 6.75 & 6.66 & 6.75 \\ \text{Thrust/CUDA Solution} & 0.240 & 0.273 & 0.322 & 0.392 & 0.711 & 1.20 & 2.13 & 4.26 & 8.52 & 17.21 \\ \text{Thrust/CUDA Total} & 6.99 & 6.97 & 7.06 & 7.03 & 7.43 & 7.82 & 8.84 & 11.01 & 15.19 & 23.96 \\ \text{Thrust/CUDA Solution Ratio} & 0.439 & 0.202 & 0.0943 & 0.0433 & 0.0276 & 0.0141 & 0.00715 & 0.00382 & 0.00183 & 0.000886 \\ \text{Thrust/CUDA Total Ratio} & 12.78 & 5.15 & 2.07 & 0.777 & 0.289 & 0.0924 & 0.0297 & 0.00987 & 0.00326 & 0.00123 \\ \hline \text{CUDA C Start} & 6.97 & 6.95 & 6.90 & 6.97 & 6.94 & 6.96 & 6.96 & 6.95 & 6.94 & 6.92 \\ \text{CUDA C Solution} & 0.144 & 0.156 & 0.258 & 0.416 & 0.731 & 1.51 & 3.09 & 6.48 & 13.82 & 29.27 \\ \text{CUDA C Total} & 7.12 & 7.11 & 7.16 & 7.38 & 7.67 & 8.47 & 10.05 & 13.42 & 20.76 & 36.19 \\ \text{CUDA C Solution Ratio} & 0.263 & 0.115 & 0.076 & 0.0460 & 0.0284 & 0.0178 & 0.0104 & 0.00581 & 0.00297 & 0.00151 \\ \text{CUDA C Total Ratio} & 13.01 & 5.25 & 2.10 & 0.816 & 0.298 & 0.100 & 0.0338 & 0.0120 & 0.00446 & 0.00186 \\ \hline \end{array}
\begin{array}{l|ccccccccccccc} \hline N_k & 128 & 256 & 512 & 1,024 & 2,048 & 4,096 & 8,192 & 16,384 & 32,768 & 65,536 \\ \hline \text{C++} & 0.137 & 0.332 & 0.900 & 2.73 & 9.17 & 33.40 & 126.72 & 496.24 & 1,977.93 & 7,892.46 \\ \hline \text{Matlab} & 11.18 & 22.43 & 45.23 & 93.67 & 204.50 & 476.54 & 1,212.05 & 3,587.33 & 11,189.66 & 37,720.38 \\ \text{Matlab Ratio} & 81.42 & 67.59 & 50.26 & 34.28 & 22.29 & 14.27 & 9.56 & 7.23 & 5.66 & 4.78 \\ \hline \text{Thrust/OpenMP} & 0.0607 & 0.156 & 0.490 & 1.70 & 6.37 & 24.64 & 96.95 & 383.30 & 1,524.40 & 6,081.77 \\ \text{Thrust/OpenMP Ratio} & 0.442 & 0.471 & 0.544 & 0.621 & 0.694 & 0.738 & 0.765 & 0.772 & 0.771 & 0.771 \\ \hline \text{Thrust/CUDA Start} & 6.73 & 6.73 & 6.69 & 6.66 & 6.72 & 6.78 & 6.77 & 6.75 & 6.74 & 6.78 \\ \text{Thrust/CUDA Solution} & 0.173 & 0.234 & 0.348 & 0.628 & 1.53 & 4.23 & 13.71 & 53.33 & 191.99 & 774.87 \\ \text{Thrust/CUDA Total} & 6.90 & 6.97 & 7.04 & 7.29 & 8.26 & 11.02 & 20.48 & 60.08 & 198.73 & 781.65 \\ \text{Thrust/CUDA Solution Ratio} & 1.26 & 0.704 & 0.387 & 0.230 & 0.167 & 0.127 & 0.108 & 0.107 & 0.0971 & 0.0982 \\ \text{Thrust/CUDA Total Ratio} & 50.26 & 20.99 & 7.83 & 2.67 & 0.900 & 0.330 & 0.162 & 0.121 & 0.100 & 0.0990 \\ \hline \text{CUDA C Start} & 6.94 & 6.93 & 6.95 & 6.96 & 6.81 & 6.90 & 6.92 & 6.85 & 6.97 & 6.99 \\ \text{CUDA C Solution} & 0.103 & 0.134 & 0.257 & 0.664 & 2.12 & 7.81 & 29.76 & 116.80 & 462.31 & 1,844.37 \\ \text{CUDA C Total} & 7.04 & 7.07 & 7.20 & 7.62 & 8.93 & 14.71 & 36.68 & 123.65 & 469.28 & 1,851.36 \\ \text{CUDA C Solution Ratio} & 0.750 & 0.404 & 0.286 & 0.243 & 0.231 & 0.234 & 0.235 & 0.235 & 0.234 & 0.234 \\ \text{CUDA C Total Ratio} & 51.26 & 21.30 & 8.01 & 2.79 & 0.974 & 0.440 & 0.289 & 0.249 & 0.237 & 0.235 \\ \hline \end{array}