Improve parfor Performance

Improve `parfor` Performance

You can improve the performance of parfor-loops in various ways. This includes parallel creation of arrays inside the loop; profiling parfor-loops; slicing arrays; and optimizing your code on local workers before running on a cluster.

Where to Create Arrays

When you create a large array in the client before your parfor-loop, and access it within the loop, you might observe slow execution of your code. To improve performance, tell each MATLAB^® worker to create its own arrays, or portions of them, in parallel. You can save the time of transferring data from client to workers by asking each worker to create its own copy of these arrays, in parallel, inside the loop. Consider changing your usual practice of initializing variables before a for-loop, avoiding needless repetition inside the loop. You might find that parallel creation of arrays inside the loop improves performance.

Performance improvement depends on different factors, including

size of the arrays
time needed to create arrays
worker access to all or part of the arrays
number of loop iterations that each worker performs

Consider all factors in this list when you are considering to convert for-loops to parfor-loops. For more details, see Convert for-Loops Into parfor-Loops.

As an alternative, consider the parallel.pool.Constant function to establish variables on the pool workers before the loop. These variables remain on the workers after the loop finishes, and remain available for multiple parfor-loops. You might improve performance using parallel.pool.Constant, because the data is transferred only once to the workers.

In this example, you first create a big data set D and execute a parfor-loop accessing D. Then you use D to build a parallel.pool.Constant object, which allows you to reuse the data by copying D to each worker. Measure the elapsed time using tic and toc for each case and note the difference.

function constantDemo

D = rand(1e7, 1);
tic
for i = 1:20
    a = 0;
    parfor j = 1:60
        a = a + sum(D);
    end
end
toc

tic
D = parallel.pool.Constant(D);
for i = 1:20
    b = 0;
    parfor j = 1:60
        b = b + sum(D.Value);
    end
end
toc

>> constantDemo
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
Elapsed time is 63.839702 seconds.
Elapsed time is 10.194815 seconds.

In the second case, you send the data only once. You can enhance the performance of the parfor-loop by using the parallel.pool.Constant object.

Profiling `parfor`-loops

You can profile a parfor-loop by measuring the time elapsed using tic and toc. You can also measure how much data is transferred to and from the workers in the parallel pool by using ticBytes and tocBytes. Note that this is different from profiling MATLAB code in the usual sense using the MATLAB profiler, see Profile to Improve Performance (MATLAB).

This example calculates the spectral radius of a matrix and converts a for-loop into a parfor-loop. Measure the resulting speedup and the amount of transferred data.

In the MATLAB Editor, enter the following for-loop. Add tic and toc to measure the time elapsed. Save the file as MyForLoop.m.
```
function a = MyForLoop(A)

tic
for i = 1:200
    a(i) = max(abs(eig(rand(A))));
end
toc
```

Run the code, and note the elapsed time.

a = MyForLoop(500);

Elapsed time is 31.935373 seconds.

In MyForLoop.m, replace the for-loop with a parfor-loop. Add ticBytes and tocBytes to measure how much data is transferred to and from the workers in the parallel pool. Save the file as MyParforLoop.m.
```
ticBytes(gcp);
parfor i = 1:200
    a(i) = max(abs(eig(rand(A))));
end
tocBytes(gcp)
```

Run the new code, and run it again. Note that the first run is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time for the second run.

By default, MATLAB automatically opens a parallel pool of workers on your local machine.

a = MyParforLoop(500);

Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
...
             BytesSentToWorkers    BytesReceivedFromWorkers
             __________________    ________________________

    1        15340                  7024                   
    2        13328                  5712                   
    3        13328                  5704                   
    4        13328                  5728                   
    Total    55324                 24168                   

Elapsed time is 10.760068 seconds.

The elapsed time is 31.9 seconds in serial and 10.8 seconds in parallel, and shows that this code benefits from converting to a parfor-loop.

Slicing Arrays

If a variable is initialized before a parfor-loop, then used inside the parfor-loop, it has to be passed to each MATLAB worker evaluating the loop iterations. Only those variables used inside the loop are passed from the client workspace. However, if all occurrences of the variable are indexed by the loop variable, each worker receives only the part of the array it needs.

As an example, you first run a parfor-loop using a sliced variable and measure the elapsed time.

% Sliced version

M = 100;
N = 1e6;
data = rand(M, N);

tic
parfor idx = 1:M
    out2(idx) = sum(data(idx, :)) ./ N;
end
toc

Elapsed time is 2.261504 seconds.

Now suppose that you accidentally use a reference to the variable data instead of N inside the parfor-loop. The problem here is that the call to size(data, 2) converts the sliced variable into a broadcast (non-sliced) variable.

% Accidentally non-sliced version

clear

M = 100;
N = 1e6;
data = rand(M, N);

tic
parfor idx = 1:M
    out2(idx) = sum(data(idx, :)) ./ size(data, 2);
end
toc

Elapsed time is 8.369071 seconds.

Note that the elapsed time is greater for the accidentally broadcast variable.

In this case, you can easily avoid the non-sliced usage of data, because the result is a constant, and can be computed outside the loop. In general, you can perform computations that depend only on broadcast data before the loop starts, since the broadcast data cannot be modified inside the loop. In this case, the computation is trivial, and results in a scalar result, so you benefit from taking the computation out of the loop.

Optimizing on Local vs. Cluster Workers

Running your code on local workers might offer the convenience of testing your application without requiring the use of cluster resources. However, there are certain drawbacks or limitations with using local workers. Because the transfer of data does not occur over the network, transfer behavior on local workers might not be indicative of how it will typically occur over a network.

With local workers, because all the MATLAB worker sessions are running on the same machine, you might not see any performance improvement from a parfor-loop regarding execution time. This can depend on many factors, including how many processors and cores your machine has. The key point here is that a cluster might have more cores available than your local machine. If your code can be multithreaded by MATLAB, then the only way to go faster is to use more cores to work on the problem, using a cluster.

You might experiment to see if it is faster to create the arrays before the loop (as shown on the left below), rather than have each worker create its own arrays inside the loop (as shown on the right).

Try the following examples running a parallel pool locally, and notice the difference in time execution for each loop. First open a local parallel pool:

parpool('local')

Run the following examples, and execute again. Note that the first run for each case is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time, for each case, for the second run.

tic;
n = 200;
M = magic(n);
R = rand(n);
parfor i = 1:n
   A(i) = sum(M(i,:).*R(n+1-i,:));
end
toc

tic;
n = 200;
parfor i = 1:n
   M = magic(n);
   R = rand(n);
   A(i) = sum(M(i,:).*R(n+1-i,:));
end
toc

Running on a remote cluster, you might find different behavior, as workers can simultaneously create their arrays, saving transfer time. Therefore, code that is optimized for local workers might not be optimized for cluster workers, and vice versa.

Documentation