Measure and Improve GPU Performance

Getting Started with GPU Benchmarking

You can use various benchmark tests in MATLAB^® to measure the performance of your GPU:

Use gpuBench in MATLAB Central File Exchange to do various tests, including both memory and compute intensive tasks in both single and double precision. Compare the performance of a display card with a compute card. For more information, see http://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench.
Use the paralleldemo_gpu_bench script in Measuring GPU Performance to obtain information on your PCI bus speed, GPU memory read/write and peak calculation performances for double precision matrix calculations.

Improve Performance Using Single Precision Calculations

You can improve the performance of your GPU by doing your calculations in single precision instead of double precision. In CPU computations, on the other hand, you do not get this improvement when switching from double to single precision. The reason is that most GPU cards are designed for graphic display, demanding high single precision performance.

Typical examples of calculations suitable for single-precision computation on the GPU include image processing and machine learning, see e.g. https://www.mathworks.com/deep-learning-in-the-cloud-with-matlab-white-paper. However, other types of calculations, such as linear algebra problems, typically require double precision processing.

You can get a performance improvement of up to a factor of 50 for single compared to double precision calculations, depending on the GPU card and total number of cores. High end compute cards typically show a smaller improvement. You can determine the performance improvement of your particular GPU by using gpuBench, see http://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench.

For a comprehensive performance overview of NVIDIA^® GPU cards, see https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units. You can calculate the performance improvement factor between single precision and double precision as follows:

Find the GPU on the wiki page above.
Get the stated single and double precision performance values from the table. If there is no double precision GFLOPS value, assume the ratio is 24‐32x slower for double precision.
Divide the stated single precision GFLOPS value by the double precision GFLOPS value.

Note

If you have a mobile graphics card in your laptop, you can use this card for GPU computing. However, the laptop GPU is likely to be much less powerful than the desktop machine equivalent and so performance is reduced.

Basic Workflow for Improving Performance

The purpose of GPU computing in MATLAB is to speed up your applications. This topic discusses fundamental concepts and practices that can help you achieve better performance on the GPU, such as the configuration of the GPU hardware and best practices within your code. It discusses the trade-off between implementation difficulty and performance, and describes the criteria you might use to choose between using gpuArray functions, arrayfun, MEX-files, or CUDA kernels. Finally, it describes how to accurately measure performance on the GPU.

When converting MATLAB code to run on the GPU, it is best to start with MATLAB code that already performs well. While the GPU and CPU have different performance characteristics, the general guidelines for writing good MATLAB code also help you write good MATLAB code for the GPU. The first step is almost always to profile your CPU code. The lines of code that the profiler shows taking the most time on the CPU will likely be ones that you must concentrate on when you code for the GPU.

It is easiest to start converting your code using MATLAB built-in functions that support gpuArray data. These functions take gpuArray inputs, perform calculations on the GPU, and return gpuArray outputs. A list of the MATLAB functions that support gpuArray data is found in Run Built-In Functions on a GPU. In general these functions support the same arguments and data types as standard MATLAB functions that are calculated in the CPU. Any limitations in these functions for gpuArrays are described in their command-line help (e.g., help gpuArray/qr).

If all the functions that you want to use are supported on the GPU, running code on the GPU may be as simple as calling gpuArray to transfer input data to the GPU, and calling gather to retrieve the output data from the GPU when finished. In many cases, you might need to vectorize your code, replacing looped scalar operations with MATLAB matrix and vector operations. While vectorizing is generally a good practice on the CPU, it is usually critical for achieving high performance on the GPU. For more information, see Vectorize for Improved GPU Performance.

Advanced Tools for Improving Performance

It is possible that even after converting inputs to gpuArrays and vectorizing your code, there are operations in your algorithm that are either not built-in functions, or that are not fast enough to meet your application’s requirements. In such situations you have three main options: use arrayfun to precompile element-wise parts of your application, make use of GPU library functions, or write a custom CUDA kernel.

If you have a purely element-wise function, you can improve its performance by calling it with arrayfun. The arrayfun function on the GPU turns an element-wise MATLAB function into a custom CUDA kernel, thus reducing the overhead of performing the operation. Often, there is a subset of your application that can be used with arrayfun even if the entire application cannot be. The example Improve Performance of Element-wise MATLAB Functions on the GPU using ARRAYFUN shows the basic concepts of this approach; and the example Using ARRAYFUN for Monte-Carlo Simulations shows how this can be done in simulations for a finance application.

MATLAB provides an extensive library of GPU-enabled functions in Parallel Computing Toolbox™, Image Processing Toolbox™, Signal Processing Toolbox™, and other products. However, there are many libraries of additional functions that do not have direct built-in analogs in MATLAB’s GPU support. Examples include the NVIDIA Performance Primitives library and the CURAND library, which are included in the CUDA toolkit that ships with MATLAB. If you need to call a function in one of these libraries, you can do so using the GPU MEX interface. This interface allows you to extract the pointers to the device data from MATLAB gpuArrays so that you can pass these pointers to GPU functions. You can convert the returned values into gpuArrays for return to MATLAB. For more information see Run MEX-Functions Containing CUDA Code.

Finally, you have the option of writing a custom CUDA kernel for the operation that you need. Such kernels can be directly integrated into MATLAB using the CUDAKernel object.

The example Illustrating Three Approaches to GPU Computing: The Mandelbrot Set shows how to implement a simple calculation using three of the approaches mentioned in this section. This example begins with MATLAB code that is easily converted to run on the GPU, rewrites the code to use arrayfun for element-wise operations, and finally shows how to integrate a custom CUDA kernel for the same operation.

Alternately, you can write a CUDA kernel as part of a MEX-file and call it using the CUDA Runtime API inside the MEX-file. Either of these approaches might let you work with low-level features of the GPU, such as shared memory and texture memory, that are not directly available in MATLAB code. For more details, see the example Accessing Advanced CUDA Features Using MEX.

Best Practices for Improving Performance

Hardware Configuration

In general you can achieve the best performance when your GPU is dedicated to computing. It is usually not practical to use the same GPU device for both computations and graphics, because of the amount of memory taken up for problems of reasonable size and the constant use of the device by the system for graphics. If possible, obtain a separate device for graphics. Details of configuring your device for compute or graphics depend on the operating system and driver version.

On Windows^® systems, a GPU device can be in one of two modes: Windows Display Driver Model (WDDM) or Tesla Compute Cluster (TCC) mode. For best performance, any devices used for computing should be in TCC mode. Consult NVIDIA documentation for more details.

NVIDIA’s highest-performance compute devices, the Tesla line, support error correcting codes (ECC) when reading and writing GPU memory. The purpose of ECC is to correct for occasional bit-errors that occur normally when reading or writing dynamic memory. One technique to improve performance is to turn off ECC to increase the achievable memory bandwidth. While the hardware can be configured this way, MathWorks does not recommend this practice. The potential loss of accuracy due to silent errors can be more harmful than the performance benefit.

MATLAB Coding Practices

This topic describes general techniques that help you achieve better performance on the GPU. Some of these tips apply when writing MATLAB code for the CPU as well.

Data in MATLAB arrays is stored in column-major order. Therefore, it is beneficial to operate along the first or column dimension of your array. If one dimension of your data is significantly longer than others, you might achieve better performance if you make that the first dimension. Similarly, if you frequently operate along a particular dimension, it is usually best to have it as the first dimension. In some cases, if consecutive operations target different dimensions of an array, it might be beneficial to transpose or permute the array between these operations.

GPUs achieve high performance by calculating many results in parallel. Thus, matrix and higher-dimensional array operations typically perform much better than operations on vectors or scalars. You can achieve better performance by rewriting your loops to make use of higher-dimensional operations. The process of revising loop-based, scalar-oriented code to use MATLAB matrix and vector operations is called vectorization. For more details, see Using Vectorization (MATLAB).

By default, all operations in MATLAB are performed in double-precision floating-point arithmetic. However, most operations support a variety of data types, including integer and single-precision floating-point. Today’s GPUs and CPUs typically have much higher throughput when performing single-precision operations, and single-precision floating-point data occupies less memory. If your application’s accuracy requirements allow the use of single-precision floating-point, it can greatly improve the performance of your MATLAB code.

The GPU sits at the end of a data transfer mechanism known as the PCI bus. While this bus is an efficient, high-bandwidth way to transfer data from the PC host memory to various extension cards, it is still much slower than the overall bandwidth to the global memory of the GPU device or of the CPU (for more details, see the example Measuring GPU Performance). In addition, transfers from the GPU device to MATLAB host memory cause MATLAB to wait for all pending operations on the device to complete before executing any other statements. This can significantly hurt the performance of your application. In general, you should limit the number of times you transfer data between the MATLAB workspace and the GPU. If you can transfer data to the GPU once at the start of your application, perform all the calculations you can on the GPU, and then transfer the results back into MATLAB at the end, that generally results in the best performance. Similarly, when possible it helps to create arrays directly on the GPU, using either the 'gpuArray' or the 'like' option for functions such as zeros (e.g., Z = zeros(___,'gpuArray') or Z = zeros(N,'like',g) for existing gpuArray g).

Measure Performance on the GPU

The best way to measure performance on the GPU is to use gputimeit. This function takes as input a function handle with no input arguments, and returns the measured execution time of that function. It takes care of such benchmarking considerations as repeating the timed operation to get better resolution, executing the function before measurement to avoid initialization overhead, and subtracting out the overhead of the timing function. Also, gputimeit ensures that all operations on the GPU have completed before the final timing.

For example, consider measuring the time taken to compute the lu factorization of a random matrix A of size N-by-N. You can do this by defining a function that does the lu factorization and passing the function handle to gputimeit:

A = rand(N,'gpuArray');
fh = @() lu(A);
gputimeit(fh,2); % 2nd arg indicates number of outputs

You can also measure performance with tic and toc. However, to get accurate timing on the GPU, you must wait for operations to complete before calling toc. There are two ways to do this. You can call gather on the final GPU output before calling toc: this forces all computations to complete before the time measurement is taken. Alternately, you can use the wait function with a GPUDevice object as its input. For example, if you wanted to measure the time taken to compute the lu factorization of matrix A using tic, toc, and wait, you can do it as follows:

gd = gpuDevice();
tic();
[l,u] = lu(A);
wait(gd);
tLU = toc();

You can also use the MATLAB profiler to show how computation time is distributed in your GPU code. Note, that to accomplish timing measurements, the profiler runs each line of code independently, so it cannot account for overlapping (asynchronous) execution such as might occur during normal operation. For timing whole algorithms, you should use tic and toc, or gputimeit, as described above. Also, the profile might not yield correct results for user-defined MEX functions if they run asynchronously.

Vectorize for Improved GPU Performance

This example shows you how to improve performance by running a function on the GPU instead of the CPU, and by vectorizing the calculations.

Consider a function that performs fast convolution on the columns of a matrix. Fast convolution, which is a common operation in signal processing applications, transforms each column of data from the time domain to the frequency domain, multiplies it by the transform of a filter vector, transforms back to the time domain, and stores the result in an output matrix.

function y = fastConvolution(data,filter)
[m,n] = size(data);
% Zero-pad filter to the column length of data, and transform
filter_f = fft(filter,m);

% Create an array of zeros of the same size and class as data
y = zeros(m,n,'like',data);

% Transform each column of data
for ix = 1:n
    af = fft(data(:,ix));
    y(:,ix) = ifft(af .* filter_f);
end
end

Execute this function in the CPU on data of a particular size, and measure the execution time using the MATLAB timeit function. The timeit function takes care of common benchmarking considerations, such as accounting for startup and overhead.

a = complex(randn(4096,100),randn(4096,100));  % Data input
b = randn(16,1);                               % Filter input
c = fastConvolution(a,b);                      % Calculate output
ctime = timeit(@()fastConvolution(a,b));       % Measure CPU time
disp(['Execution time on CPU = ',num2str(ctime)]);

On a sample machine, this code displays the output:

Execution time on CPU = 0.019335

Now execute this function on the GPU. You can do this easily by changing the input data to be gpuArrays rather than normal MATLAB arrays. The 'like' syntax used when creating the output inside the function ensures that y will be a gpuArray if data is a gpuArray.

ga = gpuArray(a);                              % Move array to GPU
gb = gpuArray(b);                              % Move filter to GPU
gc = fastConvolution(ga,gb);                   % Calculate on GPU
gtime = gputimeit(@()fastConvolution(ga,gb));  % Measure GPU time
gerr = max(max(abs(gather(gc)-c)));            % Calculate error
disp(['Execution time on GPU = ',num2str(gtime)]);
disp(['Maximum absolute error = ',num2str(gerr)]);

On the same machine, this code displays the output:

Execution time on CPU = 0.019335
Execution time on GPU = 0.027235
Maximum absolute error = 1.1374e-14

Unfortunately, the GPU is slower than the CPU for this problem. The reason is that the for-loop is executing the FFT, multiplication, and inverse FFT operations on individual columns of length 4096. The best way to increase the performance is to vectorize the code, so that a single MATLAB function call performs more calculation. The FFT and IFFT operations are easy to vectorize: fft(A) computes the FFT of each column of a matrix A. You can perform a multiply of the filter with every column in a matrix at once using the MATLAB binary scalar expansion function bsxfun. The vectorized function looks like this:

function y = fastConvolution_v2(data,filter)
m = size(data,1);
% Zero-pad filter to the length of data, and transform
filter_f = fft(filter,m);

% Transform each column of the input
af = fft(data);

% Multiply each column by filter and compute inverse transform
y = ifft(bsxfun(@times,af,filter_f));
end

Perform the same experiment using the vectorized function:

a = complex(randn(4096,100),randn(4096,100));   % Data input
b = randn(16,1);                                % Filter input
c = fastConvolution_v2(a,b);                    % Calculate output
ctime = timeit(@()fastConvolution_v2(a,b));     % Measure CPU time
disp(['Execution time on CPU = ',num2str(ctime)]);

ga = gpuArray(a);                               % Move data to GPU
gb = gpuArray(b);                               % Move filter to GPU
gc = fastConvolution_v2(ga, gb);                % Calculate on GPU
gtime = gputimeit(@()fastConvolution_v2(ga,gb));% Measure GPU time
gerr = max(max(abs(gather(gc)-c)));             % Calculate error
disp(['Execution time on GPU = ',num2str(gtime)]);
disp(['Maximum absolute error = ',num2str(gerr)]);

Execution time on CPU = 0.010393
Execution time on GPU = 0.0020537
Maximum absolute error = 1.1374e-14

In conclusion, vectorizing the code helps both the CPU and GPU versions to run faster. However, vectorization helps the GPU version much more than the CPU. The improved CPU version is nearly twice as fast as the original; the improved GPU version is 13 times faster than the original. The GPU code went from being 40% slower than the CPU in the original version, to about five times faster in the revised version.

Troubleshooting GPUs

If you only have one GPU in your machine, then it is likely that your graphics card is also acting as your display card. In this case, your GPU is probably subject to timeout imposed by the operating system (OS). You can examine this for your GPU as follows:

gpuDevice

ans =

...

KernelExecutionTimeout: 1

If KernelExecutionTimeout = 1, then your GPU is subject to timeout imposed by the OS, ensuring that the OS is always able to print updates to the screen. If your GPU calculation takes too much time, then the operation is killed. In this case, you must restart MATLAB to resume GPU calculations successfully.

Documentation