parallel.cluster.Hadoop

Hadoop cluster for mapreducer, mapreduce and tall arrays

Constructors

parallel.cluster.Hadoop

Description

A parallel.cluster.Hadoop object provides access to a cluster for configuring mapreducer, mapreduce, and tall arrays.

Properties

A parallel.cluster.Hadoop object has the following properties.

Property	Description
`AdditionalPaths`	Paths to be added to MATLAB command search path on workers
`AttachedFiles`	Files transferred to the workers during a `mapreduce` call
`AutoAttachFiles`	Specifies whether automatically attach files
`ClusterMatlabRoot`	Specifies path to MATLAB for workers to use
`HadoopConfigurationFile`	Application configuration file to be given to Hadoop
`HadoopInstallFolder`	Installation location of Hadoop on the local machine
`HadoopProperties`	Map of name-value property pairs to be given to Hadoop
`LicenseNumber`	License number to use with MathWorks hosted licensing
`RequiresMathWorksHostedLicensing`	Specify whether cluster uses MathWorks hosted licensing
`SparkInstallFolder`	Installation location of Spark on the local machine
`SparkProperties`	Map of name-value property pairs to be given to Spark

HadoopProperties allows you to override configuration properties for Hadoop. See the list of properties in the Hadoop^® documentation.

The SparkInstallFolder is by default set to the SPARK_HOME environment variable. This is required for tall array evaluation on Hadoop (but not for mapreduce). For a correctly configured cluster, you only need to set the installation folder.

SparkProperties allows you to override configuration properties for Spark. See the list of properties in the Spark^® documentation.

Help

For further help, type:

help parallel.cluster.Hadoop

Specify Memory Properties

Spark enabled Hadoop clusters place limits on how much memory is available. You must adjust these limits to support your workflow.

Size of Data to Gather

The amount of data gathered to the client is limited by the Spark properties:

spark.driver.memory
spark.executor.memory

The amount of data to gather from a single Spark task must fit in these properties. A single Spark task processes one block of data from HDFS, which is 128 MB of data by default. If you gather a tall array containing most of the original data, you must ensure these properties are set to fit.

If these properties are set too small, you see an error like the following.

Error using tall/gather (line 50)
Out of memory; unable to gather a partition of size 300m from Spark.
Adjust the values of the Spark properties spark.driver.memory and 
spark.executor.memory to fit this partition.

The error message also specifies the property settings you need.

Adjust the properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, add name-value pairs to the SparkProperties property of the cluster. For example:

cluster = parallel.cluster.Hadoop;
cluster.SparkProperties('spark.driver.memory') = '2048m';
cluster.SparkProperties('spark.executor.memory') = '2048m';
mapreducer(cluster);

Specify Working Memory Size for a MATLAB Worker

The amount of working memory for a MATLAB Worker is limited by the Spark property:

spark.yarn.executor.memoryOverhead

By default, this is set to 2.5 GB. You typically need to increase this if you use arrayfun, cellfun, or custom datastores to generate large amounts of data in one go. It is advisable to increase this if you come across lost or crashed Spark Executor processes.

You can adjust these properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, add name-value pairs to the SparkProperties property of the cluster. For example:

cluster = parallel.cluster.Hadoop; 
cluster.SparkProperties('spark.yarn.executor.memoryOverhead') = '4096m';
mapreducer(cluster);

Documentation

parallel.cluster.Hadoop

Constructors

Description

Properties

Help

Specify Memory Properties

Size of Data to Gather

Specify Working Memory Size for a MATLAB Worker

See Also

See Also

Topics

Introduced in R2014b

Parallel Computing Toolbox Documentation

Other Documentation

Support