Depending on how your data fits in memory, choose one of the following methods:
If your data is currently in the memory of your local
machine, you can use the distributed
function to distribute an
existing array from the client workspace to the workers of a parallel
pool. This option can be useful for testing or before performing operations
which significantly increase the size of your arrays, such as repmat
.
If your data does not fit in the memory of your local
machine, but does fit in the memory of your cluster, you can use datastore
with the distributed
function
to read data into the memory of the workers of a parallel pool.
If your data does not fit in the memory of your cluster,
you can use datastore
with tall
arrays to partition and process
your data in chunks. See also Big Data Workflow Using Tall Arrays and Datastores.
datastore
If your data does not fit in the memory of your local machine,
but does fit in the memory of your cluster, you can use datastore
with the distributed
function
to create distributed arrays and partition the data among your workers.
This example shows how to create and load distributed arrays
using datastore
. Create a datastore using a tabular
file of airline flight data. This data set is too small to show equal
partitioning of the data over the workers. To simulate a large data
set, artificially increase the size of the datastore using repmat
.
files = repmat({'airlinesmall.csv'}, 10, 1);
ds = tabularTextDatastore(files);
Select the example variables.
ds.SelectedVariableNames = {'DepTime','DepDelay'}; ds.TreatAsMissing = 'NA';
Create a distributed table by reading the datastore in parallel. Partition the datastore with one partition per worker. Each worker then reads all data from the corresponding partition. The files must be in a shared location that is accessible by the workers.
dt = distributed(ds);
Starting parallel pool (parpool) using the 'local' profile ... connected to 4 workers.
Display summary information about the distributed table.
summary(dt)
Variables: DepTime: 1,235,230×1 double Values: min 1 max 2505 NaNs 23,510 DepDelay: 1,235,230×1 double Values: min -1036 max 1438 NaNs 23,510
Determine the size of the tall table.
size(dt)
ans = 1235230 2
Return the first few rows of dt
.
head(dt)
ans = DepTime DepDelay _______ ________ 642 12 1021 1 2055 20 1332 12 629 -1 1446 63 928 -2 859 -1 1833 3 1041 1
Finally, check how much data each worker has loaded.
spmd, dt, end
Lab 1: This worker stores dt2(1:370569,:). LocalPart: [370569×2 table] Codistributor: [1×1 codistributor1d] Lab 2: This worker stores dt2(370570:617615,:). LocalPart: [247046×2 table] Codistributor: [1×1 codistributor1d] Lab 3: This worker stores dt2(617616:988184,:). LocalPart: [370569×2 table] Codistributor: [1×1 codistributor1d] Lab 4: This worker stores dt2(988185:1235230,:). LocalPart: [247046×2 table] Codistributor: [1×1 codistributor1d]
Note that the data is partitioned equally over the workers.
For more details on datastore
, see What Is a Datastore? (MATLAB)
For more details about workflows for big data, see Choose a Parallel Computing Solution.
If your data fits in the memory of your local machine, you can
use distributed arrays to partition the data among your workers. Use
the distributed
function to create a distributed
array in the MATLAB client, and store its data on the workers of the
open parallel pool. A distributed array is distributed in one dimension,
and as evenly as possible along that dimension among the workers.
You cannot control the details of distribution when creating a distributed
array.
You can create a distributed array in several ways:
Use the distributed
function to distribute an
existing array from the client workspace to the workers of a parallel
pool.
Use any of the distributed
functions
to directly construct a distributed array on the workers. This technique
does not require that the array already exists in the client, thereby
reducing client workspace memory requirements. Functions include
and eye
(___,'distributed')
. For
a full list, see the rand
(___,'distributed')distributed
object
reference page.
Create a codistributed array inside an spmd
statement,
and then access it as a distributed array outside the spmd
statement.
This technique lets you use distribution schemes other than the default.
The first two techniques do not involve spmd
in
creating the array, but you can use spmd
to manipulate
arrays created this way. For example:
Create an array in the client workspace, and then make it a distributed array.
parpool('local',2) % Create pool W = ones(6,6); W = distributed(W); % Distribute to the workers spmd T = W*2; % Calculation performed on workers, in parallel. % T and W are both codistributed arrays here. end T % View results in client. whos % T and W are both distributed arrays here. delete(gcp) % Stop pool
Alternatively, you can use the codistributed
function,
which allows you to control more options such as dimensions and partitions,
but is often more complicated. You can create a codistributed
array
by executing on the workers themselves, either inside an spmd
statement, in pmode
,
or inside a communicating job. When creating a codistributed
array,
you can control all aspects of distribution, including dimensions
and partitions.
The relationship between distributed and codistributed arrays
is one of perspective. Codistributed arrays are partitioned among
the workers from which you execute code to create or manipulate them.
When you create a distributed array in the client, you can access
it as a codistributed array inside an spmd
statement.
When you create a codistributed array in an spmd
statement,
you can access it as a distributed array in the client. Only spmd
statements
let you access the same array data from two different perspectives.
You can create a codistributed
array in several
ways:
Use the codistributed
function inside an spmd
statement,
a communicating job, or pmode
to codistribute
data already existing on the workers running that job.
Use any of the codistributed functions to directly
construct a codistributed array on the workers. This technique does
not require that the array already exists in the workers. Functions
include
and eye
(___,'codistributed')
.
For a full list, see the rand
(___,'codistributed')codistributed
object
reference page.
Create a distributed array outside an spmd
statement,
then access it as a codistributed array inside the spmd
statement
running on the same parallel pool.
Create a codistributed array inside an spmd
statement
using a nondefault distribution scheme. First, define 1-D distribution
along the third dimension, with 4 parts on worker 1, and 12 parts
on worker 2. Then create a 3-by-3-by-16 array of zeros.
parpool('local',2) % Create pool spmd codist = codistributor1d(3,[4,12]); Z = zeros(3,3,16,codist); Z = Z + labindex; end Z % View results in client. % Z is a distributed array here. delete(gcp) % Stop pool
For more details on codistributed arrays, see Working with Codistributed Arrays.
codistributed
| datastore
| distributed
| eye
| rand
| repmat
| spmd
| tall