mcc
Command WorkflowSupported Platform: Linux® only.
This example shows you how to use the mcc
command to create a
deployable archive consiting of MATLAB® map and reduce functions and then pass the deployable archive as a
payload argument to a job submitted to a Hadoop® cluster.
Goal: Calculate the maximum arrival delay of an airline from the given dataset.
Dataset: | airlinesmall.csv |
Description: |
Airline departure and arrival information from 1987-2008. |
Location: | /usr/local/MATLAB/R2017b/toolbox/matlab/demos |
When compared to the Hadoop Compiler app workflow, this workflow requires the explicit creation of a Hadoop settings file. Follow the example for details.
Prerequisites
Start this example by creating a new work folder that is visible to the MATLAB search path.
Before starting MATLAB, at a terminal, set the environment variable HADOOP_PREFIX
to point to the Hadoop installation folder. For example:
Shell | Command |
---|---|
csh / tcsh | % setenv HADOOP_PREFIX /usr/lib/hadoop |
bash | $ export HADOOP_PREFIX=/usr/lib/hadoop |
This example uses /usr/lib/hadoop
as directory where Hadoop is installed. Your Hadoop installation directory maybe different.
If you forget setting the HADOOP_PREFIX
environment variable prior to starting MATLAB, set it up using the MATLAB function setenv
at the MATLAB command prompt as soon as you start MATLAB. For example:
setenv('HADOOP_PREFIX','/usr/lib/hadoop')
Install the MATLAB Runtime in a folder that is accessible by every worker node in the Hadoop cluster. This example uses /usr/local/MATLAB/MATLAB_Runtime/v
as the location of the MATLAB Runtime folder.##
If you don’t have the MATLAB Runtime, you can download it from the website at: http://www.mathworks.com/products/compiler/mcr
.
Replace all references to the MATLAB Runtime version v
in this example with the MATLAB Runtime version number corresponding to your MATLAB release. For example, MATLAB R2017b has MATLAB Runtime version number ##
v92
. For information about MATLAB Runtime version numbers corresponding MATLAB releases, see this list.
Copy the map function maxArrivalDelayMapper.m
from /usr/local/MATLAB/R2017b/toolbox/matlab/demos
folder to the work folder.
For more information, see Write a Map Function (MATLAB).
Copy the reduce function maxArrivalDelayReducer.m
from
folder to the work folder.matlabroot
/toolbox/matlab/demos
For more information, see Write a Reduce Function (MATLAB).
Create the directory /user/
on HDFS™ and copy the file <username>
/datasetsairlinesmall.csv
to that directory. Here
refers to your user name in HDFS. <username>
$ ./hadoop fs -copyFromLocal airlinesmall.csv hdfs://host:54310/user/<username>/datasets |
Procedure
Start MATLAB and verify that the HADOOP_PREFIX
environment variable has been set. At the command prompt,
type:
>> getenv('HADOOP_PREFIX')
If ans
is empty, review the Prerequisites section above to see how you can set the
HADOOP_PREFIX
environment variable.
Create a datastore
to the file
airlinesmall.csv
and save it to a
.mat
file. This datastore
object
is meant to capture the structure of your actual dataset on HDFS.
ds = datastore('airlinesmall.csv','TreatAsMissing','NA',... 'SelectedVariableNames','ArrDelay','ReadSize',1000); save('infoAboutDataset.mat','ds')
In most cases, you will start off by working on a small sample dataset
residing on a local machine that is reprensetative of the actual dataset
on the cluster. This sample dataset has the same structure and variables
as the actual dataset on the cluster. By creating a
datastore
object to the dataset residing on your
local machine you are taking a snapshot of that structure. By having
access to this datastore
object, a Hadoop job executing on the cluster will know how to access and
process the actual dataset residing on HDFS.
In this example, the sample dataset (local) and the actual dataset on HDFS are the same.
Create a configuration file (config.txt
) that
specifies the input type of the data, the format of the data specified
by the datastore
created in the previous step, the
output type of the data, the name of map function, and the name of
reduce function.
mw.ds.in.type = tabulartext mw.ds.in.format = infoAboutDataset.mat mw.ds.out.type = keyvalue mw.mapper = maxArrivalDelayMapper mw.reducer = maxArrivalDelayReducer
Use the mcc
command with the -m
flag to create a deployable archive. The -m
flag
creates a standard executable that can be run from a command line.
However, the mcc
command cannot package the results
in an installer. The command must be entered as a single line.
mcc -H -W 'hadoop:maxArrivalDelay,CONFIG:config.txt' maxArrivalDelayMapper.m maxArrivalDelayReducer.m -a infoAboutDataset.mat
For more information, see mcc
.
MATLAB
Compiler™ creates a shell script
run_maxarrivaldelay.sh
, a deployable archive
airlinesmall.ctf
, and a log file
mccExcludedfiles.log
.
Incorporate the deployable archive containing MATLAB map and reduce functions into a Hadoop mapreduce job from a Linux shell using the following command:
$ hadoop \ jar /usr/local/MATLAB/MATLAB_Runtime/v##/toolbox/mlhadoop/jar/a2.2.0/mwmapreduce.jar \ com.mathworks.hadoop.MWMapReduceDriver \ -D mw.mcrroot=/usr/local/MATLAB/MATLAB_Runtime/v## \ maxArrivalDelay.ctf \ hdfs://host:54310/user/<username>/datasets/airlinesmall.csv \ hdfs://host:54310/user/<username>/results |
Alternately, you can incorporate the deployable archive containing MATLAB map and reduce functions into a Hadoop mapreduce job using the shell script generated by the Hadoop Compiler app. At the Linux shell type the following command:
$ ./run_maxArrivalDelay.sh \ /usr/local/MATLAB/MATLAB_Runtime/v## \ -D mw.mcrroot=/usr/local/MATLAB/MATLAB_Runtime/v## \ hdfs://host:54310/user/username/datasets/airlinesmall.csv \ hdfs://host:54310/user/<username>/results |
To examine the results, switch to the MATLAB desktop and create a datastore
to the
results on HDFS. You can then view the results using the
read
method.
d = datastore('hdfs:///user/<username>/results/part*');
read(d)
ans = Key Value _________________ ______ 'MaxArrivalDelay' [1014]
Other examples of map
and reduce
functions
are available at toolbox/matlab/demos
folder. You can use other
examples to prototype similar deployable archives that run against Hadoop. For more information, see Build Effective Algorithms with MapReduce (MATLAB).
KeyValueDatastore
| TabularTextDatastore
| datastore
| deploytool
| mcc