This topic explains how to create an executable kernel from
CU or PTX (parallel thread execution) files, and run that kernel on
a GPU from MATLAB®. The kernel is represented in MATLAB by a CUDAKernel
object,
which can operate on MATLAB array or gpuArray variables.
The following steps describe the CUDAKernel general workflow:
Use compiled PTX code to create a CUDAKernel object, which contains the GPU executable code.
Set properties on the CUDAKernel object to control its execution on the GPU.
Call feval
on the CUDAKernel with
the required inputs, to run the kernel on the GPU.
MATLAB code that follows these steps might look something like this:
% 1. Create CUDAKernel object. k = parallel.gpu.CUDAKernel('myfun.ptx','myfun.cu','entryPt1'); % 2. Set object properties. k.GridSize = [8 1]; k.ThreadBlockSize = [16 1]; % 3. Call feval with defined inputs. g1 = gpuArray(in1); % Input gpuArray. g2 = gpuArray(in2); % Input gpuArray. result = feval(k,g1,g2);
The following sections provide details of these commands and workflow steps.
If you have a CU file you want to execute on the GPU, you must
first compile it to create a PTX file. One way to do this is with
the nvcc
compiler in the NVIDIA® CUDA® Toolkit.
For example, if your CU file is called myfun.cu
,
you can create a compiled PTX file with the shell command:
nvcc -ptx myfun.cu
This generates the file named myfun.ptx
.
With a .cu
file and a .ptx
file
you can create a CUDAKernel
object
in MATLAB that you can then use to evaluate the kernel:
k = parallel.gpu.CUDAKernel('myfun.ptx','myfun.cu');
You cannot save
or load
CUDAKernel
objects.
If you do not have the CU file corresponding to your PTX file, you can specify the C prototype for your C kernel instead of the CU file. For example:
k = parallel.gpu.CUDAKernel('myfun.ptx','float *, const float *, float');
Another use for C prototype input is when your source code uses an unrecognized renaming of a supported data type. (See the supported types below.) Suppose your kernel comprises the following code.
typedef float ArgType; __global__ void add3( ArgType * v1, const ArgType * v2 ) { int idx = threadIdx.x; v1[idx] += v2[idx]; }
ArgType
itself is not recognized as a supported
data type, so the CU file that includes it cannot be directly used
as input when creating the CUDAKernel object in MATLAB. However, the
supported input types to the add3
kernel can be
specified as C prototype input to the CUDAKernel constructor. For
example:
k = parallel.gpu.CUDAKernel('test.ptx','float *, const float *','add3');
The supported C/C++ standard data types are listed in the following table.
Float Types | Integer Types | Boolean and Character Types |
---|---|---|
|
|
|
Also, the following integer types are supported when you include
the tmwtypes.h
header file in your program.
Integer Types |
---|
|
The header file is shipped as
.
You include the file in your program with the line:matlabroot
/extern/include/tmwtypes.h
#include "tmwtypes.h"
All inputs can be scalars or pointers, and can be labeled const
.
The C declaration of a kernel is always of the form:
__global__ void aKernel(inputs ...)
The kernel must return nothing, and operate only on its input arguments (scalars or pointers).
A kernel is unable to allocate any form of memory, so all outputs must be pre-allocated before the kernel is executed. Therefore, the sizes of all outputs must be known before you run the kernel.
In principle, all pointers passed into the kernel
that are not const
could contain output data, since
the many threads of the kernel could modify that data.
When translating the definition of a kernel in C into MATLAB:
All scalar inputs in C (double
, float
, int
,
etc.) must be scalars in MATLAB, or scalar (i.e., single-element)
gpuArray variables.
All const
pointer inputs in C (const
double *
, etc.) can be scalars or matrices in MATLAB. They
are cast to the correct type, copied onto the device, and a pointer
to the first element is passed to the kernel. No information about
the original size is passed to the kernel. It is as though the kernel
has directly received the result of mxGetData
on
an mxArray
.
All nonconstant pointer inputs in C are transferred to the kernel exactly as nonconstant pointers. However, because a nonconstant pointer could be changed by the kernel, this will be considered as an output from the kernel.
Inputs from MATLAB workspace scalars and arrays are cast into the requested type and then passed to the kernel. However, gpuArray inputs are not automatically cast, so their type and complexity must exactly match those expected.
These rules have some implications. The most notable is that every output from a kernel must necessarily also be an input to the kernel, since the input allows the user to define the size of the output (which follows from being unable to allocate memory on the GPU).
When you create a kernel object without a terminating semicolon, or when you type the object variable at the command line, MATLAB displays the kernel object properties. For example:
k = parallel.gpu.CUDAKernel('conv.ptx','conv.cu')
k = parallel.gpu.CUDAKernel handle Package: parallel.gpu Properties: ThreadBlockSize: [1 1 1] MaxThreadsPerBlock: 512 GridSize: [1 1 1] SharedMemorySize: 0 EntryPoint: '_Z8theEntryPf' MaxNumLHSArguments: 1 NumRHSArguments: 2 ArgumentTypes: {'in single vector' 'inout single vector'}
The properties of a kernel object control some of its execution behavior. Use dot notation to alter those properties that can be changed.
For a descriptions of the object properties, see the CUDAKernel
object
reference page. A typical reason for modifying the settable properties
is to specify the number of threads, as described below.
If your PTX file contains multiple entry points, you can identify
the particular kernel in myfun.ptx
that you want
the kernel object k
to refer to:
k = parallel.gpu.CUDAKernel('myfun.ptx','myfun.cu','myKernel1');
A single PTX file can contain multiple entry points to different
kernels. Each of these entry points has a unique name. These names
are generally mangled (as in C++ mangling). However, when generated
by nvcc
the PTX name always contains the original
function name from the CU file. For example, if the CU file defines
the kernel function as
__global__ void simplestKernelEver( float * x, float val )
then the PTX code contains an entry that might be called _Z18simplestKernelEverPff
.
When you have multiple entry points, specify the entry name
for the particular kernel when calling CUDAKernel
to
generate your kernel.
The CUDAKernel
function searches for your
entry name in the PTX file, and matches on any substring occurrences.
Therefore, you should not name any of your entries as substrings of
any others.
You might not have control over the original entry names, in which case you must be aware of the unique mangled derived for each. For example, consider the following function template.
template <typename T> __global__ void add4( T * v1, const T * v2 ) { int idx = threadIdx.x; v1[idx] += v2[idx]; }
When the template is expanded out for float and double, it results
in two entry points, both of which contain the substring add4
.
template __global__ void add4<float>(float *, const float *); template __global__ void add4<double>(double *, const double *);
The PTX has corresponding entries:
_Z4add4IfEvPT_PKS0_ _Z4add4IdEvPT_PKS0_
Use entry point add4If
for the float version,
and add4Id
for the double version.
k = parallel.gpu.CUDAKernel('test.ptx','double *, const double *','add4Id');
You specify the number of computational threads for your CUDAKernel by setting two of its object properties:
GridSize
— A vector of three
elements, the product of which determines the number of blocks.
ThreadBlockSize
— A vector
of three elements, the product of which determines the number of threads
per block. (Note that the product cannot exceed the value of the property MaxThreadsPerBlock
.)
The default value for both of these properties is [1
1 1]
, but suppose you want to use 500 threads to run element-wise
operations on vectors of 500 elements in parallel. A simple way to
achieve this is to create your CUDAKernel and set its properties accordingly:
k = parallel.gpu.CUDAKernel('myfun.ptx','myfun.cu'); k.ThreadBlockSize = [500,1,1];
Generally, you set the grid and thread block sizes based on the sizes of your inputs. For information on thread hierarchy, and multiple-dimension grids and blocks, see the NVIDIA CUDA C Programming Guide.
Use the feval
function to evaluate a CUDAKernel
on the GPU. The following examples show how to execute a kernel using
MATLAB workspace variables and gpuArray variables.
Assume that you have already written some kernels in a native language and want to use them in MATLAB to execute on the GPU. You have a kernel that does a convolution on two vectors; load and run it with two random input vectors:
k = parallel.gpu.CUDAKernel('conv.ptx','conv.cu'); result = feval(k,rand(100,1),rand(100,1));
Even if the inputs are constants or variables for MATLAB workspace
data, the output is gpuArray
.
It might be more efficient to use gpuArray
objects
as input when running a kernel:
k = parallel.gpu.CUDAKernel('conv.ptx','conv.cu'); i1 = gpuArray(rand(100,1,'single')); i2 = gpuArray(rand(100,1,'single')); result1 = feval(k,i1,i2);
Because the output is a gpuArray
, you can
now perform other operations using this input or output data without
further transfers between the MATLAB workspace and the GPU. When all
your GPU computations are complete, gather your final result data
into the MATLAB workspace:
result2 = feval(k,i1,i2); r1 = gather(result1); r2 = gather(result2);
When calling [out1, out2] = feval(kernel, in1, in2,
in3)
, the inputs in1
, in2
,
and in3
correspond to each of the input arguments
to the C function within your CU file. The outputs out1
and out2
store
the values of the first and second non-const pointer input arguments
to the C function after the C kernel has been executed.
For example, if the C kernel within a CU file has the following signature:
void reallySimple( float * pInOut, float c )
the corresponding kernel object (k
) in MATLAB
has the following properties:
MaxNumLHSArguments: 1 NumRHSArguments: 2 ArgumentTypes: {'inout single vector' 'in single scalar'}
Therefore, to use the kernel object from this code with feval
,
you need to provide feval
two input arguments (in
addition to the kernel object), and you can use one output argument:
y = feval(k,x1,x2)
The input values x1
and x2
correspond
to pInOut
and c
in the C function
prototype. The output argument y
corresponds to
the value of pInOut
in the C function prototype
after the C kernel has executed.
The following is a slightly more complicated example that shows a combination of const and non-const pointers:
void moreComplicated( const float * pIn, float * pInOut1, float * pInOut2 )
The corresponding kernel object in MATLAB then has the properties:
MaxNumLHSArguments: 2 NumRHSArguments: 3 ArgumentTypes: {'in single vector' 'inout single vector' 'inout single vector'}
You can use feval
on this code’s kernel
(k
) with the syntax:
[y1,y2] = feval(k,x1,x2,x3)
The three input arguments x1
, x2
,
and x3
, correspond to the three arguments that
are passed into the C function. The output arguments y1
and y2
,
correspond to the values of pInOut1
and pInOut2
after
the C kernel has executed.
This example adds two doubles together in the GPU. You should have the NVIDIA CUDA Toolkit installed, and have CUDA-capable drivers for your device.
The CU code to do this is as follows.
__global__ void add1( double * pi, double c ) { *pi += c; }
The directive __global__
indicates that this
is an entry point to a kernel. The code uses a pointer to send out
the result in pi
, which is both an input and an
output. Put this code in a file called test.cu
in
the current directory.
Compile the CU code at the shell command line to generate
a PTX file called test.ptx
.
nvcc -ptx test.cu
Create the kernel in MATLAB. Currently this PTX file
only has one entry so you do not need to specify it. If you were to
put more kernels in, you would specify add1
as
the entry.
k = parallel.gpu.CUDAKernel('test.ptx','test.cu');
Run the kernel with two numeric inputs. By default, a kernel runs on one thread.
result = feval(k,2,3)
result = 5
This example extends the previous one to add two vectors together. For simplicity, assume that there are exactly the same number of threads as elements in the vectors and that there is only one thread block.
The CU code is slightly different from the last example. Both inputs are pointers, and one is constant because you are not changing it. Each thread will simply add the elements at its thread index. The thread index must work out which element this thread should add. (Getting these thread- and block-specific values is a very common pattern in CUDA programming.)
__global__ void add2( double * v1, const double * v2 ) { int idx = threadIdx.x; v1[idx] += v2[idx]; }
Save this code in the file test.cu
.
Compile as before using nvcc
.
nvcc -ptx test.cu
If this code was put in the same CU file along with the code of the first example, you need to specify the entry point name this time to distinguish it.
k = parallel.gpu.CUDAKernel('test.ptx','test.cu','add2');
Before you run the kernel, set the number of threads correctly for the vectors you want to add.
N = 128; k.ThreadBlockSize = N; in1 = ones(N,1,'gpuArray'); in2 = ones(N,1,'gpuArray'); result = feval(k,in1,in2);
For an example that shows how to work with CUDA, and provides CU and PTX files for you to experiment with, see Illustrating Three Approaches to GPU Computing: The Mandelbrot Set.