This example shows how to create a deployable archive that calculates
mean airline delays. It runs against Hadoop® using
the Hadoop Compiler app, which is accessible from deploytool
.
The archive that you create contains all the MATLAB® content associated
with the component. The Hadoop Compiler app generates mcc
commands
that help you customize to your specification.
This example uses the MaxMapReduceExample.m
example
file and the airline dataset, airlinesmall.csv
,
both available at the toolbox/matlab/demos
folder.
Move your example code to a new working folder for deployment. The
new working folder on the path ensures that the files are accessible
by MATLAB Compiler™.
Note: Deployable archive that runs against Hadoop using Hadoop Compiler app is supported only on Linux®. |
Set environment variables and cluster properties for your Hadoop configuration. These properties are necessary for submitting jobs to your Hadoop cluster.
Set up the environment variable, HADOOP_HOME
to
point at your Hadoop install folder. Modify the system path to
include $HADOOP_HOME/bin
.
Install the MATLAB Runtime in a folder that is accessible by every worker node in the Hadoop cluster.
The following example uses /hd-shared/MCR/v84
.
For information on installing the MATLAB Runtime see Install the MATLAB Runtime.
Copy the airlinesmall.csv
into Hadoop Distributed
File System (HDFS™) folder /datasets/airlinemod
.
Copy the map function maxArrivalDelayMapper.m
from toolbox/matlab/demos
folder
to the working folder.
function maxArrivalDelayMapper (data, info, intermKVStore) partMax = max(data.ArrDelay); add(intermKVStore,'PartialMaxArrivalDelay',partMax);
For more information, see Write a Map Function.
Copy the reduce function maxArrivalDelayReducer.m
from toolbox/matlab/demos
folder
to the working folder.
function maxArrivalDelayReducer(intermKey, intermValIter, outKVStore) maxVal = -inf; while hasnext(intermValIter) maxVal = max(getnext(intermValIter), maxVal); end add(outKVStore,'MaxArrivalDelay',maxVal);
For more information, see Write a Reduce Function.
Create a datastore
object from the MaxMapReduceExample.m
and
save the datastore
to a .mat
file.
ds = datastore('airlinesmall.csv','TreatAsMissing','NA',... 'SelectedVariableNames','ArrDelay','ReadSize',1000); save('airlinesmall.mat','ds')
For more information, Getting Started with Datastore
Launch the Hadoop Compiler app through the MATLAB command line or through the apps gallery. At the MATLAB command line type the following command:
hadoopCompiler
In the Map Function section of the
toolstrip, click the plus button to add map file, which contains the
map function. Browse and select one map function maxArrivalDelayMapper.m
.
In the Reduce Function section of
the toolstrip, click the plus button to add reduce file, which contains
the reduce function. Browse and select one reduce function maxArrivalDelayReducer.m
.
In the Input Types section, select tabulartext
as
input type. By default, the input type is tabulartext
.
In the Output Types section, select tabulartext
as
output type. By default, the output type is binary
.
Rename the application name to maxArrivalDelay
.
In the Data store file field, click
Browse and select the airlinesmall.mat
file, which
contains the saved datastore object.
Click Package to build a deployable archive.
The Hadoop Compiler app creates a log file PackagingLog.txt
and
two folders for_redistribution
and for_testing
.
The for_redistribution
folder contains readme file,
shell script run_maxarrivaldelay.sh
, and deployable
archive maxarrivaldelay.ctf
. The for_testing
folder
contains the same three files and a log file mccExcludedfiles.log
.
At the MATLAB command
prompt, run the deployable archive against Hadoop using
the generated shell script. The arguments in the command are MCRRoot
, Hadoop properties
defined using -D
flag, the data file, and the new
results folder. The command to execute the script must be entered
as a single line.
cd maxArrivalDelay/for_testing
!./run_maxarrivaldelay.sh /hd-shared/MCR/v84
-D mw.mcrroot = /hd-shared/MCR/v84 /datasets/airlinemod/airlinesmall.csv
myresults
Examine the results using the Hadoop command.
!./hadoop fs -cat myresults/*
'MaxArrivalDelay' [1014]
Other examples of map and reduce functions are available at toolbox/matlab/demos
folder.
You can use other examples to prototype similar deployable archives
that run against Hadoop.
For more information, see Build Effective Algorithms with MapReduce.
datastore
| deploytool
| KeyValueDatastore
| TabularTextDatastore