This example shows how to solve an embarrassingly parallel problem with uneven work distribution using for drange
. The for drange
splits iterations equally. As a result it can do suboptimal load balancing, which is visible using the parallel profiler. The procedures described here are also applicable to other work distribution problems.
Prerequisites:
Interactive Parallel Mode in Parallel Computing Toolbox™ (See pmode
in the user's guide.)
The plots in this example are produced from a 12-node MATLAB® cluster. If not otherwise specified, everything else is shown running on a 4-node local cluster. In particular, all text output is from a local cluster.
This example uses for drange
to illustrate how you use the profiler to observe suboptimal load distribution. Let's look at this embarrassingly parallel for drange
loop.
The objective in the pctdemo_aux_proftaskpar
example is to calculate the eig
of a random matrix of increasing size, and pick the maximum value from the resulting vector. The crucial issue is the increasing matrix size based on the loop counter ii
. Here is the basic iteration:
v(ii) = max( abs( eig( rand(ii) ) ) )
;
The actual for
loop can be seen in the example code.
The code for this example can be found in pctdemo_aux_proftaskpar.
A good practice is to reset the parallel profiler on the cluster before turning on mpiprofile in pmode
. It makes sure the data is cleared and the profiler is off and in default -messagedetail setting.
P>> mpiprofile reset; P>> mpiprofile on;
Inside a for drange
there cannot be any communication between labs so the -messagedetail
can be set to simplified (see help mpiprofile
). If you do not specify the -messagedetail
option and you run a program with no communication, you get 0s in the communication fields.
P>> v = zeros( 1, 300, codistributor() );
P>> tic;pctdemo_aux_proftaskpar('drange');toc;
1 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 0.287845 seconds.
2 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 0.351070 seconds.
3 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 0.335363 seconds.
4 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 0.412805 seconds.
5 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 0.482021 seconds.
6 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 0.683651 seconds.
7 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 0.838188 seconds.
8 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 1.005636 seconds.
9 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 1.128090 seconds.
10 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 1.398578 seconds.
11 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 1.589610 seconds.
12 Start of for-drange loop. The computational complexity increases with the loop index. Done Elapsed time is 1.825993 seconds.
In this algorithm the elapsed time should always be longest on the last lab. We use tic toc
here so that we can compare the longest running time to a parfor
. The use of profiling inside a parfor
loop with mpiprofile
is currently not supported.
To get the profiler interface, simply type mpiprofile viewer
in pmode
. You can also view data from a parallel job. See the help or documentation for information on how to do this.
P>> mpiprofile viewer; % The viewer action also turns off the profiler
1 Sending pmode lab2client to the MATLAB client for asynchronous evaluation.
When the profiler interface opens, by default the Function Summary Report is shown for lab 1. Click Compare max vs min TotalTime to see the difference in work distribution between the first and last lab for all the functions called. Look at the pctdemo_aux_proftaskpar
function:
Here are a few steps for spotting uneven work distribution on the MATLAB workers. Uneven work distribution almost certainly prevents optimal speedup of serial algorithms.
Select max Time Aggregate
from the Manual Comparison Selection listbox (see Using the Parallel Profiler in Pmode). With this selection you can observe the effective total time for a parallel program.
Click Compare max vs. min TotalTime. As you can see, this loop takes much longer on the last MATLAB worker compared to the first one. The for drange
is clearly not distributing the work correctly, at least on these two labs. To confirm this is true for all the labs, you can use the histogram feature of the Plot View page. Before doing so, click the pctdemo_aux_proftaskpar
function to get more specific plots.
Click Plot Time Histograms to see how the computation time was distributed on the four local labs. Observe the total execution time histogram.
In the top figure of this page, only the first few labs take approximately the same amount of time; the others take significantly longer. This large difference in the total time distribution is an indicator of suboptimal load balancing.
Optimal performance for this type of parallelism requires manual distribution of the iterations in pmode
or the use of parfor
with parpool
. To get better work distribution (with pmode) in this type of a problem, you need to create a random distribution of the tasks rather than rely on for drange
to statically partition the iterations.
Using parfor
is generally better suited to this type of task. With the parfor (i=n:N)
construct you get dynamic work distribution which splits the iterations at execution time, across all labs. Thus the cluster is better utilized. You can see this by running the same function outside of pmode using a parfor
construct. This results in significantly higher speedup compared to the for drange
.
To try this with parfor, run the following commands on the MATLAB client outside of pmode
.
pmode close;
parpool;
tic;pctdemo_aux_proftaskpar('parfor');toc;
You should get an output that looks like: Done Elapsed time is 6.376887 seconds.
There is a significant speedup (it's nearly two times faster on the 4-node cluster) using parfor
instead of for drange
, with no change to the actual algorithm. Note that parfor
operates as a standard for
loop inside of pmode
. Please ensure you try parfor outside of pmode to get the speedup. See help for parpool
and parfor
.
To make a serial (iteration independent) for
-loop parallel you need to add the drange
option when inside a parallel job, or replace for
with a parfor
. The parfor
loop will only work as intended with parpool
. You can view the different styles of for loop parallelism in the code shown in this example. See pctdemo_aux_proftaskpar. The parfor
version is under the case 'parfor'
and the drange
version is under the case 'drange'
.