mapreduce
on a Parallel PoolIf you have Parallel
Computing Toolbox™ installed, execution
of mapreduce
can open a parallel
pool on the cluster specified by your default profile, for use as
the execution environment.
You can set your parallel preferences so that a pool does not
automatically open. In this case, you must explicitly start a pool
if you want mapreduce
to use it for parallelization
of its work. See Specify Your Parallel Preferences.
For example, the following conceptual code starts a pool, and
some time later uses that open pool for the mapreducer
configuration.
p = parpool('local',n);
mr = mapreducer(p);
outds = mapreduce(tds,@MeanDistMapFun,@MeanDistReduceFun,mr)
mapreduce
can run on any cluster that supports
parallel pools. The examples in this topic use a local cluster, which
works for all Parallel
Computing Toolbox installations.
mapreduce
The following example calculates the mean arrival delay from
a datastore of airline data. First it runs mapreduce
in
the MATLAB client session, then it runs in parallel on a local cluster.
The mapreducer
function explicitly controls the
execution environment.
Begin by starting a parallel pool on a local cluster.
p = parpool('local',4);
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
Create two MapReducer objects for specifying the different execution
environments for mapreduce
.
inMatlab = mapreducer(0); inPool = mapreducer(p);
Create and preview the datastore. The data set used in this
example is available in
.matlabroot
/toolbox/matlab/demos
ds = datastore('airlinesmall.csv','TreatAsMissing','NA',... 'SelectedVariableNames','ArrDelay','ReadSize',1000); preview(ds)
ArrDelay ________ 8 8 21 13 4 59 3 11
Next, run the mapreduce
calculation in
the MATLAB® client session. The map and reduce functions are available
in
.matlabroot
/toolbox/matlab/demos
meanDelay = mapreduce(ds,@meanArrivalDelayMapper,@meanArrivalDelayReducer,inMatlab);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 10% Reduce 0% Map 20% Reduce 0% Map 30% Reduce 0% Map 40% Reduce 0% Map 50% Reduce 0% Map 60% Reduce 0% Map 70% Reduce 0% Map 80% Reduce 0% Map 90% Reduce 0% Map 100% Reduce 100%
readall(meanDelay)
Key Value __________________ ________ 'MeanArrivalDelay' [7.1201]
Then, run the calculation on the current parallel pool. Note
that the output text indicates a parallel mapreduce
.
meanDelay = mapreduce(ds,@meanArrivalDelayMapper,@meanArrivalDelayReducer,inPool);
Parallel mapreduce execution on the parallel pool: ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 100% Reduce 50% Map 100% Reduce 100%
readall(meanDelay)
Key Value __________________ ________ 'MeanArrivalDelay' [7.1201]
With this relatively small data set, a performance improvement
with the parallel pool is not likely. This example is to show the
mechanism for running mapreduce
on a parallel
pool. As the data set grows, or the map and reduce functions themselves
become more computationally intensive, you might expect to see improved
performance with the parallel pool, compared to running mapreduce
in
the MATLAB client session.
When running parallel mapreduce
on a cluster,
the order of the key-value pairs in the output is different compared
to running mapreduce
in MATLAB. If your application
depends on the arrangement of data in the output, you must sort the
data according to your own requirements.