parfor
PerformanceYou can improve the performance of parfor
-loops
in various ways. This includes parallel creation of arrays inside
the loop; profiling parfor
-loops; slicing arrays;
and optimizing your code on local workers before running on a cluster.
When you create a large array in the client before your parfor
-loop,
and access it within the loop, you might observe slow execution of
your code. To improve performance, tell each MATLAB® worker to
create its own arrays, or portions of them, in parallel. You can save
the time of transferring data from client to workers by asking each
worker to create its own copy of these arrays, in parallel, inside
the loop. Consider changing your usual practice of initializing variables
before a for
-loop, avoiding needless repetition
inside the loop. You might find that parallel creation of arrays inside
the loop improves performance.
Performance improvement depends on different factors, including
size of the arrays
time needed to create arrays
worker access to all or part of the arrays
number of loop iterations that each worker performs
Consider all factors in this list when you are considering
to convert for
-loops to parfor
-loops.
For more details, see Convert for-Loops Into parfor-Loops.
As an alternative, consider the parallel.pool.Constant
function
to establish variables on the pool workers before the loop. These
variables remain on the workers after the loop finishes, and remain
available for multiple parfor
-loops. You might
improve performance using parallel.pool.Constant
,
because the data is transferred only once to the workers.
In this example, you first create a big data set D
and
execute a parfor
-loop accessing D
.
Then you use D
to build a parallel.pool.Constant
object,
which allows you to reuse the data by copying D
to
each worker. Measure the elapsed time using tic
and toc
for
each case and note the difference.
function constantDemo D = rand(1e7, 1); tic for i = 1:20 a = 0; parfor j = 1:60 a = a + sum(D); end end toc tic D = parallel.pool.Constant(D); for i = 1:20 b = 0; parfor j = 1:60 b = b + sum(D.Value); end end toc
>> constantDemo Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. Elapsed time is 63.839702 seconds. Elapsed time is 10.194815 seconds.
parfor
-loop
by using the parallel.pool.Constant
object.parfor
-loopsYou can profile a parfor
-loop by measuring
the time elapsed using tic
and toc
.
You can also measure how much data is transferred to and from the
workers in the parallel pool by using ticBytes
and tocBytes
.
Note that this is different from profiling MATLAB code in the
usual sense using the MATLAB profiler, see Profile to Improve Performance (MATLAB).
This example calculates the spectral radius of a matrix and
converts a for
-loop into a parfor
-loop.
Measure the resulting speedup and the amount of transferred data.
In the MATLAB Editor, enter the following for
-loop.
Add tic
and toc
to measure the
time elapsed. Save the file as MyForLoop.m
.
function a = MyForLoop(A) tic for i = 1:200 a(i) = max(abs(eig(rand(A)))); end toc
Run the code, and note the elapsed time.
a = MyForLoop(500);
Elapsed time is 31.935373 seconds.
In MyForLoop.m
, replace the for
-loop
with a parfor
-loop. Add ticBytes
and tocBytes
to
measure how much data is transferred to and from the workers in the
parallel pool. Save the file as MyParforLoop.m
.
ticBytes(gcp); parfor i = 1:200 a(i) = max(abs(eig(rand(A)))); end tocBytes(gcp)
Run the new code, and run it again. Note that the first run is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time for the second run.
By default, MATLAB automatically opens a parallel pool of workers on your local machine.
a = MyParforLoop(500);
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. ... BytesSentToWorkers BytesReceivedFromWorkers __________________ ________________________ 1 15340 7024 2 13328 5712 3 13328 5704 4 13328 5728 Total 55324 24168 Elapsed time is 10.760068 seconds.
parfor
-loop.If a variable is initialized before a parfor
-loop,
then used inside the parfor
-loop, it has to be
passed to each MATLAB worker evaluating the loop iterations.
Only those variables used inside the loop are passed from the client
workspace. However, if all occurrences of the variable are indexed
by the loop variable, each worker receives only the part of the array
it needs.
As an example, you first run a parfor
-loop
using a sliced variable and measure the elapsed time.
% Sliced version M = 100; N = 1e6; data = rand(M, N); tic parfor idx = 1:M out2(idx) = sum(data(idx, :)) ./ N; end toc
Elapsed time is 2.261504 seconds.
Now suppose that you accidentally use a reference to the variable data
instead
of N
inside the parfor
-loop.
The problem here is that the call to size(data, 2)
converts
the sliced variable into a broadcast (non-sliced) variable.
% Accidentally non-sliced version clear M = 100; N = 1e6; data = rand(M, N); tic parfor idx = 1:M out2(idx) = sum(data(idx, :)) ./ size(data, 2); end toc
Elapsed time is 8.369071 seconds.
In this case, you can easily avoid the non-sliced usage of data
,
because the result is a constant, and can be computed outside the
loop. In general, you can perform computations that depend only on
broadcast data before the loop starts, since the broadcast data cannot
be modified inside the loop. In this case, the computation is trivial,
and results in a scalar result, so you benefit from taking the computation
out of the loop.
Running your code on local workers might offer the convenience of testing your application without requiring the use of cluster resources. However, there are certain drawbacks or limitations with using local workers. Because the transfer of data does not occur over the network, transfer behavior on local workers might not be indicative of how it will typically occur over a network.
With local workers, because all the MATLAB worker sessions
are running on the same machine, you might not see any performance
improvement from a parfor
-loop regarding execution
time. This can depend on many factors, including how many processors
and cores your machine has. The key point here is that a cluster might
have more cores available than your local machine. If your code can
be multithreaded by MATLAB, then the only way to go faster is to use
more cores to work on the problem, using a cluster.
You might experiment to see if it is faster to create the arrays before the loop (as shown on the left below), rather than have each worker create its own arrays inside the loop (as shown on the right).
Try the following examples running a parallel pool locally, and notice the difference in time execution for each loop. First open a local parallel pool:
parpool('local')
Run the following examples, and execute again. Note that the first run for each case is slower than the second run, because the parallel pool has to be started and you have to make the code available to the workers. Note the elapsed time, for each case, for the second run.
tic; n = 200; M = magic(n); R = rand(n); parfor i = 1:n A(i) = sum(M(i,:).*R(n+1-i,:)); end toc | tic; n = 200; parfor i = 1:n M = magic(n); R = rand(n); A(i) = sum(M(i,:).*R(n+1-i,:)); end toc |
Running on a remote cluster, you might find different behavior, as workers can simultaneously create their arrays, saving transfer time. Therefore, code that is optimized for local workers might not be optimized for cluster workers, and vice versa.