Run mapreduce on a Parallel Pool

Run `mapreduce` on a Parallel Pool

Start Parallel Pool
Compare Parallel mapreduce

Start Parallel Pool

If you have Parallel Computing Toolbox™ installed, execution of mapreduce can open a parallel pool on the cluster specified by your default profile, for use as the execution environment.

You can set your parallel preferences so that a pool does not automatically open. In this case, you must explicitly start a pool if you want mapreduce to use it for parallelization of its work. See Specify Your Parallel Preferences.

For example, the following conceptual code starts a pool, and some time later uses that open pool for the mapreducer configuration.

p = parpool('local',n);
mr = mapreducer(p);
outds = mapreduce(tds,@MeanDistMapFun,@MeanDistReduceFun,mr)

Note

mapreduce can run on any cluster that supports parallel pools. The examples in this topic use a local cluster, which works for all Parallel Computing Toolbox installations.

Compare Parallel `mapreduce`

The following example calculates the mean arrival delay from a datastore of airline data. First it runs mapreduce in the MATLAB client session, then it runs in parallel on a local cluster. The mapreducer function explicitly controls the execution environment.

Begin by starting a parallel pool on a local cluster.

p = parpool('local',4);

Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.

Create two MapReducer objects for specifying the different execution environments for mapreduce.

inMatlab = mapreducer(0);
inPool = mapreducer(p);

Create and preview the datastore. The data set used in this example is available in matlabroot/toolbox/matlab/demos.

ds = datastore('airlinesmall.csv','TreatAsMissing','NA',...
     'SelectedVariableNames','ArrDelay','ReadSize',1000);
preview(ds)

Next, run the mapreduce calculation in the MATLAB^® client session. The map and reduce functions are available in matlabroot/toolbox/matlab/demos.

meanDelay = mapreduce(ds,@meanArrivalDelayMapper,@meanArrivalDelayReducer,inMatlab);

********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  10% Reduce   0%
Map  20% Reduce   0%
Map  30% Reduce   0%
Map  40% Reduce   0%
Map  50% Reduce   0%
Map  60% Reduce   0%
Map  70% Reduce   0%
Map  80% Reduce   0%
Map  90% Reduce   0%
Map 100% Reduce 100%

readall(meanDelay)

           Key             Value  
    __________________    ________

    'MeanArrivalDelay'    [7.1201]

Then, run the calculation on the current parallel pool. Note that the output text indicates a parallel mapreduce.

meanDelay = mapreduce(ds,@meanArrivalDelayMapper,@meanArrivalDelayReducer,inPool);

Parallel mapreduce execution on the parallel pool:
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map 100% Reduce  50%
Map 100% Reduce 100%

readall(meanDelay)

           Key             Value  
    __________________    ________

    'MeanArrivalDelay'    [7.1201]

With this relatively small data set, a performance improvement with the parallel pool is not likely. This example is to show the mechanism for running mapreduce on a parallel pool. As the data set grows, or the map and reduce functions themselves become more computationally intensive, you might expect to see improved performance with the parallel pool, compared to running mapreduce in the MATLAB client session.

Note

When running parallel mapreduce on a cluster, the order of the key-value pairs in the output is different compared to running mapreduce in MATLAB. If your application depends on the arrangement of data in the output, you must sort the data according to your own requirements.

Documentation