Improve Performance of Small Matrix Problems on the GPU using PAGEFUN

This example shows how to use pagefun to improve the performance of applying a large number of independent rotations and translations to objects in a 3-D environment. This is typical of a range of problems which involve a large batch of calculations on small arrays.

GPUs are most effective when carrying out calculations on very large matrices. In MATLAB® this is usually achieved by vectorizing code to maximize the work done in each instruction. When you have a large data set but your calculations are divided into many small matrix operations, it can be challenging to maximize performance by running simultaneously on the many hundreds of GPU cores.

The arrayfun and bsxfun functions allow scalar operations to be carried out in parallel on the GPU. The pagefun function adds the capability of carrying out matrix operations in batch in a similar way. The pagefun function is available in Parallel Computing Toolbox™ for use with gpuArrays.

In this example, a robot is navigating a known map containing a large number of features that the robot can identify using its sensors. The robot locates itself in the map by measuring the relative position and orientation of those objects and comparing them to the map locations. Assuming the robot is not completely lost, it can use any difference between the two to correct its position, for instance by using a Kalman Filter. We will focus on the first part of the algorithm.

This example is a function so that the helper functions can be nested inside it.

function paralleldemo_gpu_pagefun

Set up the map

Let's create a map of objects with randomized positions and orientations in a large room.

numObjects = 1000;
roomDimensions = [50 50 5]; % Length * breadth * height in meters

We represent positions and orientations using 3-by-1 vectors T and 3-by-3 rotation matrices R. When we have N of these transforms we pack the translations into a 3-by-N matrix, and the rotations into a 3-by-3-by-N array. The following function initializes N transforms with random values, providing a structure as output:

    function Tform = randomTransforms(N)
        Tform.T = zeros(3, N);
        Tform.R = zeros(3, 3, N);
        for i = 1:N
            Tform.T(:,i) = rand(3, 1) .* roomDimensions';
            % To get a random orientation, we can extract an orthonormal
            % basis for a random 3-by-3 matrix.
            Tform.R(:,:,i) = orth(rand(3, 3));
        end
    end

Now use this to set up the map of object transforms, and a start location for the robot.

Map = randomTransforms(numObjects);
Robot = randomTransforms(1);

Define the equations

To correctly identify map features the robot needs to transform the map to put its sensors at the origin. Then it can find map objects by comparing what it sees with what it expects to see.

For a map object we can find its position relative to the robot and orientation by transforming its global map location:

where and are the position and orientation of the robot, and and represent the map data. The equivalent MATLAB code looks like this:

Rrel(:,:,i) = Rbot' * Rmap(:,:,i)
Trel(:,i) = Rbot' * (Tmap(:,i) - Tbot)

Perform many matrix transforms on the CPU using a for loop

We need to transform every map object to its location relative to the robot. We can do this serially by looping over all the transforms in turn. Note the 'like' syntax for zeros which will allow us to use the same code on the GPU in the next section.

    function Rel = loopingTransform(Robot, Map)
        Rel.R = zeros(size(Map.R), 'like', Map.R); % Initialize memory
        Rel.T = zeros(size(Map.T), 'like', Map.T); % Initialize memory
        for i = 1:numObjects
            Rel.R(:,:,i) = Robot.R' * Map.R(:,:,i);
            Rel.T(:,i) = Robot.R' * (Map.T(:,i) - Robot.T);
        end
    end

To time the calculation we use the timeit function, which will call loopingTransform multiple times to get an average timing. Since it requires a function with no arguments, we use the @() syntax to create an anonymous function of the right form.

cpuTime = timeit(@()loopingTransform(Robot, Map));
fprintf('It takes %3.4f seconds on the CPU to execute %d transforms.\n', ...
        cpuTime, numObjects);
It takes 0.0104 seconds on the CPU to execute 1000 transforms.

Try the same code on the GPU

To run this code on the GPU is merely a matter of copying the data into a gpuArray. When MATLAB encounters data stored on the GPU it will run any code using it on the GPU as long as it is supported.

gMap.R = gpuArray(Map.R);
gMap.T = gpuArray(Map.T);
gRobot.R = gpuArray(Robot.R);
gRobot.T = gpuArray(Robot.T);

Now we call gputimeit, which is the equivalent of timeit for code that includes GPU computation. It makes sure all GPU operations have finished before recording the time.

fprintf('Computing...\n');
gpuTime = gputimeit(@()loopingTransform(gRobot, gMap));

fprintf('It takes %3.4f seconds on the GPU to execute %d transforms.\n', ...
        gpuTime, numObjects);
fprintf(['Unvectorized GPU code is %3.2f times slower ',...
    'than the CPU version.\n'], gpuTime/cpuTime);
Computing...
It takes 0.5588 seconds on the GPU to execute 1000 transforms.
Unvectorized GPU code is 53.90 times slower than the CPU version.

Batch process using pagefun

The GPU version above was very slow because, although all calculations were independent, they ran in series. Using pagefun we can run all the computations in parallel. We also employ bsxfun to calculate the translations, since these are element-wise operations.

    function Rel = pagefunTransform(Robot, Map)
        Rel.R = pagefun(@mtimes, Robot.R', Map.R);
        Rel.T = Robot.R' * bsxfun(@minus, Map.T, Robot.T);
    end

gpuPagefunTime = gputimeit(@()pagefunTransform(gRobot, gMap));
fprintf(['It takes %3.4f seconds on the GPU using pagefun ',...
    'to execute %d transforms.\n'], gpuPagefunTime, numObjects);
fprintf(['Vectorized GPU code is %3.2f times faster ',...
    'than the CPU version.\n'], cpuTime/gpuPagefunTime);
fprintf(['Vectorized GPU code is %3.2f times faster ',...
    'than the unvectorized GPU version.\n'], gpuTime/gpuPagefunTime);
It takes 0.0008 seconds on the GPU using pagefun to execute 1000 transforms.
Vectorized GPU code is 13.55 times faster than the CPU version.
Vectorized GPU code is 730.18 times faster than the unvectorized GPU version.

Explanation

The first computation was the calculation of the rotations. This involved a matrix multiply, which translates to the function mtimes (*). We pass this to pagefun along with the two sets of rotations to be multiplied:

Rel.R = pagefun(@mtimes, Robot.R', Map.R);

Robot.R' is a 3-by-3 matrix, and Map.R is a 3-by-3-by-N array. The pagefun function matches each independent matrix from the map to the same robot rotation, and gives us the required 3-by-3-by-N output.

The translation calculation also involves a matrix multiply, but the normal rules of matrix multiplication allow this to come outside the loop without any changes. However, it also involves subtracting Robot.T from Map.T, which are different sizes. Since this is an element-by-element operation, we can use bsxfun to match up dimensions in the same way as pagefun did for the rotations:

Rel.T = Robot.R' * bsxfun(@minus, Map.T, Robot.T);

This time we needed to use the element-wise operator which maps to the function minus (-).

More advanced GPU vectorization - Solving a "lost robot" problem

If our robot was in an unknown part of the map, it might use a global search algorithm to locate itself. The algorithm would test a number of possible locations by carrying out the above computation and looking for good correspondence between the objects seen by the robot's sensors and what it would expect to see at that position.

Now we have got multiple robots as well as multiple objects. N objects and M robots should give us N*M transforms out. To distinguish 'robot space' from 'object space' we use the 4th dimension for rotations and the 3rd for translations. That means our robot rotations will be 3-by-3-by-1-by-M, and the translations will be 3-by-1-by-M.

We initialize our search with random robot locations. A good search algorithm would use topological or other clues to seed the search more intelligently.

numRobots = 10;
Robot = randomTransforms(numRobots);
Robot.R = reshape(Robot.R, 3, 3, 1, []); % Spread along the 4th dimension
Robot.T = reshape(Robot.T, 3, 1, []); % Spread along the 3rd dimension
gRobot.R = gpuArray(Robot.R);
gRobot.T = gpuArray(Robot.T);

Our new looping transform function requires two nested loops, to loop over the robots as well as over the objects.

    function Rel = loopingTransform2(Robot, Map)
        Rel.R = zeros(3, 3, numObjects, numRobots, 'like', Map.R);
        Rel.T = zeros(3, numObjects, numRobots, 'like', Map.T);
        for i = 1:numObjects
            for j = 1:numRobots
                Rel.R(:,:,i,j) = Robot.R(:,:,1,j)' * Map.R(:,:,i);
                Rel.T(:,i,j) = ...
                    Robot.R(:,:,1,j)' * (Map.T(:,i) - Robot.T(:,1,j));
            end
        end
    end

cpuTime = timeit(@()loopingTransform2(Robot, Map));
fprintf('It takes %3.4f seconds on the CPU to execute %d transforms.\n', ...
        cpuTime, numObjects*numRobots);
It takes 0.1493 seconds on the CPU to execute 10000 transforms.

For our GPU timings we use tic and toc this time, because otherwise the calculation would take too long. This will be precise enough for our purposes. To ensure any cost associated with creating the output data is included, we are calling loopingTransform2 with a single output variable, just as timeit and gputimeit do by default.

fprintf('Computing...\n');
tic;
gRel = loopingTransform2(gRobot, gMap); %#ok<NASGU> Suppress unused variable warning
gpuTime = toc;

fprintf('It takes %3.4f seconds on the GPU to execute %d transforms.\n', ...
        gpuTime, numObjects*numRobots);
fprintf(['Unvectorized GPU code is %3.2f times slower ',...
    'than the CPU version.\n'], gpuTime/cpuTime);
Computing...
It takes 7.0564 seconds on the GPU to execute 10000 transforms.
Unvectorized GPU code is 47.26 times slower than the CPU version.

As before, the looping version runs much slower on the GPU because it is not doing calculations in parallel.

The new pagefun version needs to incorporate the transpose operator as well as mtimes into a call to pagefun. We also need to squeeze the transposed robot orientations to put the spread over robots into the 3rd dimension, to match the translations. Despite this, the resulting code is considerably more compact.

    function Rel = pagefunTransform2(Robot, Map)
        Rt = pagefun(@transpose, Robot.R);
        Rel.R = pagefun(@mtimes, Rt, Map.R);
        Rel.T = pagefun(@mtimes, squeeze(Rt), ...
                        bsxfun(@minus, Map.T, Robot.T));
    end

Once again, pagefun and bsxfun expand dimensions appropriately. So where we multiply 3-by-3-by-1-by-M matrix Rt with 3-by-3-by-N-by-1 matrix Map.R, we get a 3-by-3-by-N-by-M matrix out.

gpuPagefunTime = gputimeit(@()pagefunTransform2(gRobot, gMap));
fprintf(['It takes %3.4f seconds on the GPU using pagefun ',...
    'to execute %d transforms.\n'], gpuPagefunTime, numObjects*numRobots);
fprintf(['Vectorized GPU code is %3.2f times faster ',...
    'than the CPU version.\n'], cpuTime/gpuPagefunTime);
fprintf(['Vectorized GPU code is %3.2f times faster ',...
    'than the unvectorized GPU version.\n'], gpuTime/gpuPagefunTime);
It takes 0.0025 seconds on the GPU using pagefun to execute 10000 transforms.
Vectorized GPU code is 59.97 times faster than the CPU version.
Vectorized GPU code is 2834.45 times faster than the unvectorized GPU version.

Conclusion

The pagefun function supports a number of 2-D operations, as well as most of the scalar operations supported by arrayfun and bsxfun. Together, these functions allow you to vectorize a range of computations involving matrix algebra and array manipulation, eliminating the need for loops and making huge performance gains.

Wherever you are doing small calculations on GPU data in a loop, you should consider converting to a batch implementation in this way. This can also be an opportunity to make use of the GPU to improve performance where previously it gave no performance gains.

end
Was this topic helpful?