If you have Parallel Computing Toolbox™, you can use tall arrays in your local MATLAB® session, or on a local parallel pool. You can also run tall array calculations on a cluster if you have MATLAB Distributed Computing Server™ installed. This example uses the workers in a local cluster on your machine. You can develop code locally, and then scale up, to take advantage of the capabilities offered by Parallel Computing Toolbox and MATLAB Distributed Computing Server without having to rewrite your algorithm. See also Big Data Workflow Using Tall Arrays and Datastores.
Create a datastore and convert it into a tall table.
ds = datastore('airlinesmall.csv'); varnames = {'ArrDelay', 'DepDelay'}; ds.SelectedVariableNames = varnames; ds.TreatAsMissing = 'NA';
If you have Parallel
Computing Toolbox installed, when you
use the tall
function, MATLAB automatically
starts a parallel pool of workers, unless you turn off the default
parallel pool preference. The default cluster uses local workers on
your machine.
If you want to turn off automatically opening a parallel pool,
change your parallel preferences. If you turn off the Automatically create a parallel pool option, then you must explicitly start a pool if you
want the tall
function to use
it for parallel processing. See Specify Your Parallel Preferences.
If you have Parallel Computing Toolbox, you can run the same code as the MATLAB tall table example (MATLAB) and automatically execute it in parallel on the workers of your local machine.
Create a tall table tt
from the datastore.
tt = tall(ds)
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers. tt = M×2 tall table ArrDelay DepDelay ________ ________ 8 12 8 1 21 20 13 12 4 -1 59 63 3 -2 11 -1 : : : :
The display indicates that the number of rows, M
,
is not yet known. M
is a placeholder until the
calculation completes.
Extract the arrival delay ArrDelay
from the
tall table. This action creates a new tall array variable to use in
subsequent calculations.
a = tt.ArrDelay;
You can specify a series of operations on your tall array, which
are not executed until you call gather
. Doing so
enables you to batch up commands that might take a long time. For
example, calculate the mean and standard deviation of the arrival
delay. Use these values to construct the upper and lower thresholds
for delays that are within 1 standard deviation of the mean.
m = mean(a,'omitnan'); s = std(a,'omitnan'); one_sigma_bounds = [m-s m m+s];
Use gather
to calculate one_sigma_bounds
,
and bring the answer into memory.
sig1 = gather(one_sigma_bounds)
Evaluating tall expression using the Parallel Pool 'local': Evaluation completed in 0 sec sig1 = -23.4572 7.1201 37.6975
You can specify multiple inputs and outputs to gather
if
you want to evaluate several things at once. Doing so is faster than
calling gather
separately on each tall array
. As an example, calculate the minimum and maximum arrival delay.
[max_delay, min_delay] = gather(max(a),min(a))
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 1 sec Evaluation completed in 1 sec max_delay = 1014 min_delay = -64
If you want to develop in serial and not use local workers or your specified cluster, enter the following command.
mapreducer(0);
mapreducer
to
change the execution environment after creating a tall array, then
the tall array is invalid and you must recreate it. To use local workers
or your specified cluster again, enter the following command.mapreducer(gcp);
One of the benefits of developing algorithms with tall arrays
is that you only need to write the code once. You can develop your
code locally, and then use mapreducer
to
scale up to a cluster, without needing to rewrite your algorithm.
For an example, see Use Tall Arrays on a Spark Enabled Hadoop Cluster.
datastore
| gather
| mapreducer
| parpool
| table
| tall