Supported Platform: Linux® only.
This example shows you how to deploy a standalone application to Spark™ using the MATLAB® API for Spark. Your application can be deployed against Spark using one of two supported cluster managers: local and Hadoop® YARN. This example shows you how to deploy your application using both cluster managers. For a discussion on cluster managers, see Cluster Managers Supported by Spark.
Goal: Count the number of unique airlines in the given dataset.
Dataset: | airlinesmall.csv |
Description: |
Airline departure and arrival information from 1987-2008. |
Location: | /usr/local/MATLAB/R2017b/toolbox/matlab/demos |
Helper Function
Create a MATLAB file
named carrierToCount.m
with the following
code:
function results = carrierToCount(input) tbl = input{1}; intermKeys = tbl.UniqueCarrier; [intermKeys, ~, idx] = unique(intermKeys); intermValues = num2cell(accumarray(idx, ones(size(idx)))); results = cellfun( @(x,y) {x,y} , ... intermKeys, intermValues, ... 'UniformOutput',false);
If you are using Spark version 1.6 or higher, you will need to increase the Java® heap size in MATLAB to at least 512MB. For information on how to increase the Java heap size in MATLAB, see Java Heap Memory Preferences (MATLAB).
A local cluster manager represents a pseudo Spark enabled
cluster and works in a non-distributed mode on a single machine. It
can be configured to use one worker thread, or on a multi-core machine,
multiple worker threads. In applications, it is denoted by the word local
.
A local cluster manager is handy for debugging your application prior
to full blown deployment on a Spark enabled Hadoop cluster.
Start this example by creating a new work folder that is visible to the MATLAB search path.
Create the helper function carrierToCount.m
mentioned above.
Specify Spark properties.
Use a containers.Map
object to specify Spark properties.
sparkProp = containers.Map(... {'spark.executor.cores',... 'spark.matlab.worker.debug'},... {'1',... 'true'});
Spark properties indicate the Spark execution environment of the application that is being deployed. Every application must be configured with specific Spark properties in order for it to be deployed.
For more information on Spark properties, expand the prop
value
of the 'SparkProperties'
name-value
pair in the Input Arguments section of the SparkConf
class.
Create a SparkConf
object.
Use the class matlab.compiler.mlspark.SparkConf
to
create a SparkConf
object. A SparkConf
object
stores the configuration parameters of the application being deployed
to Spark. The configuration parameters of an application are
passed onto a Spark cluster through a SparkContext.
conf = matlab.compiler.mlspark.SparkConf(... 'AppName', 'mySparkAppDepLocal', ... 'Master', 'local[1]', ... 'SparkProperties', sparkProp );
For more information on SparkConf, see matlab.compiler.mlspark.SparkConf
.
Create a SparkContext
object.
Use the class matlab.compiler.mlspark.SparkContext
with
the SparkConf
object as an input to create
a SparkContext
object.
sc = matlab.compiler.mlspark.SparkContext(conf);
A SparkContext
object
serves as an entry point to Spark by initializing a connection
to a Spark cluster. It accepts a SparkConf
object
as an input argument and uses the parameters specified in that object
to set up the internal services necessary to establish a connection
to the Spark execution environment.
For more information on SparkContext, see matlab.compiler.mlspark.SparkContext
.
Create an RDD
object
from the data.
Use the MATLAB function datastore
to
create a datastore
object pointing to
the file airlinesmall.csv
. Then use
the SparkContext method datastoreToRDD
to
convert the datastore
object to a Spark RDD
object.
% Create a MATLAB datastore (LOCAL) ds = datastore('airlinesmall.csv',... 'TreatAsMissing','NA', ... 'SelectedVariableNames','UniqueCarrier'); % Convert MATLAB datastore to Spark RDD rdd = sc.datastoreToRDD(ds);
In general, input RDDs can be created using the following
methods of the SparkContext
class: parallelize
, datastoreToRDD
,
and textFile
.
Perform operations on the RDD
object.
Use a Spark RDD method such as flatMap
to apply a function
to all elements of the RDD
object and
flatten the results. The function carrierToCount
that
was created earlier serves as the function that is going to be applied
to the elements of the RDD. A function handle to the function carrierToCount
is
passed as an input argument to the flatMap
method.
maprdd = rdd.flatMap(@carrierToCount); redrdd = maprdd.reduceByKey( @(acc,value) acc+value ); countdata = redrdd.collect(); % Count and display carrier occurrences count = 0; for i=1:numel(countdata) count = count + countdata{i}{2}; fprintf('\nCarrier Name: %s, Count: %d', countdata{i}{1}, countdata{i}{2}); end fprintf('\n Total count : %d\n', count); % Delete Spark Context delete(sc)
In general, you will provide MATLAB functions handles or anonymous functions as input arguments to Spark RDD methods known as transformations and actions. These function handles and anonymous functions are executed on the workers of the deployed application.
For a list of supported RDD transformations and actions,
see Transformations and Actions in
the Methods section of the RDD
class.
For more information on transformations and actions, see Apache Spark Basics.
Create a standalone application.
Use the mcc
command with the -m
flag
to create a standalone application. The -m
flag
creates a standard executable that can be run from a command line.
The -a
flag includes the dependent dataset airlinesmall.csv
from
the folder <matlabroot>/toolbox/matlab/demos
.
The mcc
command automatically picks
up the dependent file carrierToCount.m
as
long as it is in the same work folder.
>> mcc -m deployToSparkMlApiLocal.m -a <matlabroot>/toolbox/matlab/demos/airlinesmall.csv
The mcc
command creates
a shell script run_deployToSparkMlApiLocal.sh
to
run the executable file deployToSparkMlApiLocal
.
For more information, see mcc
.
Run the standalone application from a Linux shell using the following command:
$ ./run_deployToSparkMlApiLocal.sh /share/MATLAB/MATLAB_Runtime/v91 |
/share/MATLAB/MATLAB_Runtime/v91
is
an argument indicating the location of the MATLAB Runtime.
Prior to executing the above command, make sure the javaclasspath.txt
file
is in the same folder as the shell script and the executable.
Your application will fail to execute if it cannot
find the file javaclasspath.txt
.
Your application may also fail to execute if the optional line
containing the folder location of the Hadoop configuration files
is uncommented. To execute your application on the local
cluster
manager, this line must be commented. This line should only be uncommented
if you plan on running your application using yarn-client
as
your cluster manager on a Spark enabled Hadoop cluster.
You will see the following output:
Carrier Name: 9E, Count: 521 Carrier Name: AA, Count: 14930 Carrier Name: AQ, Count: 154 Carrier Name: AS, Count: 2910 Carrier Name: B6, Count: 806 Carrier Name: CO, Count: 8138 ... ... ... Carrier Name: US, Count: 13997 Carrier Name: WN, Count: 15931 Carrier Name: XE, Count: 2357 Carrier Name: YV, Count: 849 Total count : 123523
Code:
A yarn-client cluster manager represents a Spark enabled Hadoop cluster.
A YARN cluster manager was introduced in Hadoop 2.0. It is typically
installed on the same nodes as HDFS™. Therefore, running Spark on
YARN lets Spark access HDFS data easily. In applications,
it is denoted using the word yarn-client
.
Since the steps for deploying your application using yarn-client
as
your cluster manager are similar to using the local cluster manager
shown above, the steps are presented with minimal discussion. For
a detailed discussion of each step, check the Local case
above.
You can follow the same instructions to deploy Spark applications created using the MATLAB API for Spark to Cloudera® CDH. To see an example on MATLAB Answers™, click here.
To use Cloudera CDH encryption zones, add the JAR file
commons-codec-1.9.jar
to the static classpath of
MATLAB Runtime.
Location of the file:
$HADOOP_PREFIX/lib/commons-codec-1.9.jar
, where
$HADOOP_PREFIX is the location where Hadoop is installed.
Start this example by creating a new work folder that is visible to the MATLAB search path.
Install the MATLAB Runtime in a folder that is accessible by every worker node in
the Hadoop cluster. This example uses
/share/MATLAB/MATLAB_Runtime/v91
as the
location of the MATLAB Runtime folder.
If you don’t have the MATLAB Runtime, you can download it from the website at: http://www.mathworks.com/products/compiler/mcr
.
Copy the airlinesmall.csv
into Hadoop Distributed File System (HDFS) folder
/user/<username>/datasets
. Here
<username>
refers to your username in
HDFS.
$ ./hadoop fs -copyFromLocal airlinesmall.csv hdfs://host:54310/user/<username>/datasets |
Set up the environment variable, HADOOP_PREFIX
to
point at your Hadoop install folder. These properties are necessary
for submitting jobs to your Hadoop cluster.
setenv('HADOOP_PREFIX','/share/hadoop/hadoop-2.6.0')
The HADOOP_PREFIX
environment
variable must be set when using the MATLAB datastore
function
to point to data on HDFS.
Setting this environment variable has nothing to do with Spark.
See Relationship Between Spark and Hadoop for
more information.
Specify Spark properties.
Use a containers.Map
object to specify Spark properties.
sparkProperties = containers.Map( ... {'spark.executor.cores',... 'spark.executor.memory',... 'spark.yarn.executor.memoryOverhead',... 'spark.dynamicAllocation.enabled',... 'spark.shuffle.service.enabled',... 'spark.eventLog.enabled',... 'spark.eventLog.dir'}, ... {'1',... '2g',... '1024',... 'true',... 'true',... 'true',... 'hdfs://hadoop01glnxa64:54310/user/<username>/sparkdeploy'});
For more information on Spark properties, expand the prop
value
of the 'SparkProperties'
name-value
pair in the Input Arguments section of the SparkConf
class.
Create a SparkConf
object.
Use the class matlab.compiler.mlspark.SparkConf
to
create a SparkConf
object.
conf = matlab.compiler.mlspark.SparkConf( ... 'AppName','myApp', ... 'Master','yarn-client', ... 'SparkProperties',sparkProperties);
For more information on SparkConf, see matlab.compiler.mlspark.SparkConf
.
Create a SparkContext
object.
Use the class matlab.compiler.mlspark.SparkContext
with
the SparkConf
object as an input to create
a SparkContext
object.
sc = matlab.compiler.mlspark.SparkContext(conf);
For more information on SparkContext, see matlab.compiler.mlspark.SparkContext
.
Create an RDD
object
from the data.
Use the MATLAB function datastore
to
create a datastore
object pointing to
the file airlinesmall.csv
in HDFS. Then use the SparkContext
method datastoreToRDD
to
convert the datastore
object to a Spark RDD
object.
% Create a MATLAB datastore (HADOOP) ds = datastore(... 'hdfs:///user/<username>/datasets/airlinesmall.csv',... 'TreatAsMissing','NA',... 'SelectedVariableNames','UniqueCarrier'); % Convert MATLAB datastore to Spark RDD rdd = sc.datastoreToRDD(ds);
In general, input RDDs can be created using the following
methods of the SparkContext
class: parallelize
, datastoreToRDD
,
and textFile
.
Perform operations on the RDD
object.
Use a Spark RDD method such as flatMap
to apply a function
to all elements of the RDD
object and
flatten the results. The function carrierToCount
that
was created earlier serves as the function that is going to be applied
to the elements of the RDD. A function handle to the function carrierToCount
is
passed as an input argument to the flatMap
method.
maprdd = rdd.flatMap(@carrierToCount); redrdd = maprdd.reduceByKey( @(acc,value) acc+value ); countdata = redrdd.collect(); % Count and display carrier occurrences count = 0; for i=1:numel(countdata) count = count + countdata{i}{2}; fprintf('\nCarrier Code: %s, Count: %d', countdata{i}{1}, countdata{i}{2}); end fprintf('\n Total count : %d\n', count); % Save results to MAT file save('countdata.mat','countdata'); % Delete Spark Context delete(sc);
For a list of supported RDD transformations and actions,
see Transformations and Actions in
the Methods section of the RDD
class.
For more information on transformations and actions, see Apache Spark Basics.
Create a standalone application.
Use the mcc
command with the -m
flag
to create a standalone application. The -m
flag
creates a standalone application that can be run from a command line.
You do not need to attach the dataset airlinesmall.csv
since
it resides on HDFS.
The mcc
command automatically picks
up the dependent file carrierToCount.m
as
long as it is in the same work folder.
>> mcc -m deployToSparkMlApiHadoop.m
The mcc
command creates
a shell script run_deployToSparkMlApiHadoop.sh
to
run the executable file deployToSparkMlApiHadoop
.
For more information, see mcc
.
Run the standalone application from a Linux shell using the following command:
$ ./run_deployToSparkMlApiHadoop.sh /share/MATLAB/MATLAB_Runtime/v91 |
/share/MATLAB/MATLAB_Runtime/v91
is
an argument indicating the location of the MATLAB Runtime.
Prior to executing the above command, make sure the javaclasspath.txt
file
is in the same folder as the shell script and the executable.
Your application will fail to execute if it cannot
find the file javaclasspath.txt
.
Your application may also fail to execute if the optional line containing the folder location of the Hadoop configuration files is commented. To execute your application on a yarn-client cluster manager, this line must be uncommented. This line should only be commented if you plan on running your application using a local cluster manager.
You will see the following output:
Carrier Name: 9E, Count: 521 Carrier Name: AA, Count: 14930 Carrier Name: AQ, Count: 154 Carrier Name: AS, Count: 2910 Carrier Name: B6, Count: 806 Carrier Name: CO, Count: 8138 ... ... ... Carrier Name: US, Count: 13997 Carrier Name: WN, Count: 15931 Carrier Name: XE, Count: 2357 Carrier Name: YV, Count: 849 Total count : 123523
If the application being deployed is a MATLAB function as opposed to a MATLAB script, use the following execution syntax:
$ ./run_<applicationName>.sh \ <MATLAB_Runtime_Location> \ [Spark arguments] \ [Application arguments]
$ ./run_deployToSparkMlApiHadoop.sh.sh \ /usr/local/MATLAB/MATLAB_Runtime/v91 \ yarn-client \ hdfs://host:54310/user/<username>/datasets/airlinesmall.csv \ hdfs://host:54310/user/<username>/result
Code: