mapreduce
on a Hadoop ClusterBefore you can run mapreduce
on
a Hadoop® cluster, make sure that the cluster and client machine
are properly configured. Consult your system administrator, or see Configure a Hadoop Cluster.
When running mapreduce
on a Hadoop cluster
with binary output (the default), the resulting KeyValueDatastore
points
to Hadoop Sequence files, instead of binary MAT-files as generated
by mapreduce
in other environments. For more
information, see the 'OutputType'
argument
description on the mapreduce
reference
page.
When running mapreduce
on a Hadoop cluster,
the order of the key-value pairs in the output is different compared
to running mapreduce
in other environments. If
your application depends on the arrangement of data in the output,
you must sort the data according to your own requirements.
This example shows how modify the MATLAB® example for calculating mean airline delays to run on a Hadoop cluster.
First, you must set environment variables and cluster properties as appropriate for your specific Hadoop configuration. See your system administrator for the values for these and other properties necessary for submitting jobs to your cluster.
setenv('HADOOP_HOME','/share/hadoop/a2.2.0'); cluster = parallel.cluster.Hadoop; cluster.HadoopProperties('mapred.job.tracker') = 'hadoophost1:50031'; cluster.HadoopProperties('fs.default.name') = 'hdfs://hadoophost2:8020'; outputFolder = '/home/user/logs/hadooplog';
Note
The specified |
Create a MapReducer object to specify that mapreduce
should
use your Hadoop cluster. .
mr = mapreducer(cluster);
Create and preview the datastore. The data set is available
in
.matlabroot
/toolbox/matlab/demos
ds = datastore('airlinesmall.csv','TreatAsMissing','NA',... 'SelectedVariableNames','ArrDelay','ReadSize',1000); preview(ds)
ArrDelay ________ 8 8 21 13 4 59 3 11
Call mapreduce
to execute on the Hadoop cluster
specified by mr
. The map and reduce functions are
available in
.matlabroot
/toolbox/matlab/demos
meanDelay = mapreduce(ds,@meanArrivalDelayMapper,@meanArrivalDelayReducer,mr,... 'OutputFolder',outputFolder)
Parallel mapreduce execution on the Hadoop cluster: ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 66% Reduce 0% Map 100% Reduce 66% Map 100% Reduce 100% meanDelay = KeyValueDatastore with properties: Files: { ' .../tmp/alafleur/tpc00621b1_4eef_4abc_8078_646aa916e7d9/part0.seq' } ReadSize: 1 key-value pairs FileType: 'seq'
Read the result.
readall(meanDelay)
Key Value __________________ ________ 'MeanArrivalDelay' [7.1201]
Although for demonstration purposes this example uses a local
data set, it is likely when using Hadoop that your data set is
stored in an HDFS™ file system. Likewise, you might be required
to store the mapreduce
output in HDFS. For
details about accessing HDFS in MATLAB, see Read from HDFS.