This example shows how to create a deployable
archive with mcc command that calculates mean airline delays. The
archive that you create contains all the MATLAB® content associated
with the component. The mcc command creates a shell
script to run the deployable archive against Hadoop®. You can use shell script to
customize the execution of the deployable archive within your particular Hadoop environment.
This example uses the MaxMapReduceExample.m example
file and the airline dataset, airlinesmall.csv,
both available at the toolbox/matlab/demos folder.
Move your example code to a new working folder for deployment. The
new working folder on the path ensures that the files are accessible
by MATLAB Compiler™.
Note: Deployable archive that runs against Hadoop using Hadoop Compiler app is supported only on Linux®. |
Set environment variables and cluster properties for your Hadoop configuration. These properties are necessary for submitting jobs to your Hadoop cluster.
Set up the environment variable, HADOOP_HOME to
point at your Hadoop install
folder. Modify the system path to include $HADOOP_HOME/bin.
Install the MATLAB Runtime in a folder that is accessible
by every worker node in the Hadoop cluster. The following example
uses /hd-shared/MCR/v84.
Download the MATLAB Runtime from the website at http://www.mathworks.com/products/compiler/mcr.
Copy the airlinesmall.csv into Hadoop Distributed
File System (HDFS™) folder /datasets/airlinemod.
Copy the map function maxArrivalDelayMapper.m from toolbox/matlab/demos folder
to the working folder.
function maxArrivalDelayMapper (data, info, intermKVStore) partMax = max(data.ArrDelay); add(intermKVStore,'PartialMaxArrivalDelay',partMax);
For more information, see Write a Map Function.
Copy the reduce function maxArrivalDelayReducer.m from toolbox/matlab/demos folder
to the working folder.
function maxArrivalDelayReducer(intermKey, intermValIter, outKVStore) maxVal = -inf; while hasnext(intermValIter) maxVal = max(getnext(intermValIter), maxVal); end add(outKVStore,'MaxArrivalDelay',maxVal);
For more information, see Write a Reduce Function.
Create a datastore object from the MaxMapReduceExample.m and
save the datastore to a .mat file.
ds = datastore('airlinesmall.csv','TreatAsMissing','NA',... 'SelectedVariableNames','ArrDelay','ReadSize',1000);
save('airlinesmall.mat','ds')
For more information, Getting Started with Datastore
A Hadoop settings
file specifies input type tabulartext, output type binary,
the map function, the reduce function, and previously created datastore.
mw.ds.in.type = tabulartext mw.ds.in.format = airlinesmall.mat mw.ds.out.type = binary mw.mapper = maxArrivalDelayMapper mw.reducer = maxArrivalDelayReducer
Use the mcc command with the -m flag
to create a deployable archive. The -m flag creates
a standard executable that can be run from a command line. However,
the mcc command cannot package the results in an
installer. The command must be entered as a single line.
mcc -H -W 'hadoop:airlinesmall,CONFIG:MWHadoopSetting.txt' maxArrivalDelayMapper.m maxArrivalDelayReducer.m -a airlinesmall.mat
For more information, see mcc.
MATLAB Compiler creates a shell script run_maxarrivaldelay.sh,
a deployable archive airlinesmall.ctf, and a log
file mccExcludedfiles.log.
Deploy the archive as a Hadoop
job by pointing the job to the csv files in the airline dataset. The
arguments in the command are MCRRoot, Hadoop properties
defined using -D flag, the data file, and the new
results folder. The command must be entered as a single line.
!./run_airlinesmall.sh /hd-shared/MCR/v84
-D mw.mcrroot = /hd-shared/MCR/v84 "/datasets/airline/*.csv"
myresultsVisualize and plot the results.
ds = datastore('hdfs://hadoop01/user/username/myresults/part*',...
'Type', 'keyvalue')
airlinesmallResult = readall(ds) Key Value
__________________ ________
'MaxArrivalDelay' [1014]
Other examples of map and reduce functions
are available at toolbox/matlab/demos folder. You
can use other examples to prototype similar deployable archives that
run against Hadoop. For
more information, see Build Effective Algorithms with MapReduce.
datastore | deploytool | KeyValueDatastore | mcc | TabularTextDatastore