Tall Arrays and Mapreduce

Analyze big data sets in parallel using MATLAB® tall arrays and datastores or mapreduce on Spark® and Hadoop® clusters, and parallel pools.

You can use Parallel Computing Toolbox™ to evaluate tall-array expressions in parallel using a parallel pool on your desktop. Using tall arrays allows you to run big data applications that do not fit in memory on your machine. You can also use Parallel Computing Toolbox to scale up tall-array processing by connecting to a parallel pool running on a MATLAB Distributed Computing Server™ cluster. Alternatively, you can use a Spark enabled Hadoop cluster running MATLAB Distributed Computing Server™. For more information, see Big Data Workflow Using Tall Arrays and Datastores.

Functions

tallCreate tall array
datastoreCreate datastore for large collections of data
mapreduceProgramming technique for analyzing data sets that do not fit in memory
mapreducerDefine parallel execution environment for mapreduce and tall arrays
partitionPartition a datastore
numpartitionsNumber of datastore partitions
parpoolCreate parallel pool on cluster
gcpGet current parallel pool

Classes

parallel.PoolAccess parallel pool
parallel.cluster.HadoopHadoop cluster for mapreducer, mapreduce and tall arrays

Examples and How To

Big Data Workflow Using Tall Arrays and Datastores

Learn about typical workflows using tall arrays to analyze big data sets.

Use Tall Arrays on a Parallel Pool

Discover tall arrays in Parallel Computing Toolbox and MATLAB Distributed Computing Server.

Use Tall Arrays on a Spark Enabled Hadoop Cluster

Create and use tall tables on Spark clusters without changing your MATLAB code.

Run mapreduce on a Parallel Pool

Try mapreduce for advanced analysis of big data using Parallel Computing Toolbox.

Run mapreduce on a Hadoop Cluster

Learn about mapreduce for advanced big data analysis on a Hadoop cluster.

Partition a Datastore in Parallel

Use partition to split your datastore into smaller parts.

Concepts

Run Code on Parallel Pools

Learn about starting and stopping parallel pools, pool size, and cluster selection.

Specify Your Parallel Preferences

Specify your preferences, and automatically create a parallel pool.

Discover Clusters and Use Cluster Profiles

Find out how to work with cluster profiles and discover cloud clusters running on Amazon EC2.

Related Information

Tall Arrays (MATLAB)

MapReduce (MATLAB)

Datastore (MATLAB)

Was this topic helpful?