The combined size of all attached files for a job is limited to 4 GB.
By default, a worker on a Windows® operating system is installed
as a service running as LocalSystem
, so it does
not have access to mapped network drives.
Often a network is configured to not allow services running
as LocalSystem
to access UNC or mapped network
shares. In this case, you must run the mdce service under a different
user with rights to log on as a service. See the section Set the User (MATLAB Distributed Computing Server) in the MATLAB®
Distributed Computing Server™ System
Administrator's Guide.
If a worker cannot find the task function, it returns the error message
Error using ==> feval Undefined command/function 'function_name'.
The worker that ran the task did not have access to the function function_name
.
One solution is to make sure the location of the function’s
file, function_name.m
, is included in the job’s AdditionalPaths
property.
Another solution is to transfer the function file to the worker by
adding function_name.m
to the AttachedFiles
property
of the job.
If a worker cannot save or load a file, you might see the error messages
??? Error using ==> save Unable to write file myfile.mat: permission denied. ??? Error using ==> load Unable to read file myfile.mat: No such file or directory.
In determining the cause of this error, consider the following questions:
What is the worker’s current folder?
Can the worker find the file or folder?
What user is the worker running as?
Does the worker have permission to read or write the file in question?
A job or task might get stuck in the queued state. To investigate the cause of this problem, look for the scheduler’s logs:
Platform LSF® schedulers might send emails with error messages.
Microsoft® Windows HPC Server (including
CCS), LSF®, PBS Pro®, and TORQUE save output messages in a
debug log. See the getDebugLog
reference
page.
If using a generic scheduler, make sure the submit function redirects error messages to a log file.
Possible causes of the problem are:
The MATLAB worker failed to start due to licensing errors, the executable is not on the default path on the worker machine, or is not installed in the location where the scheduler expected it to be.
MATLAB could not read/write the job input/output files in the scheduler’s job storage location. The storage location might not be accessible to all the worker nodes, or the user that MATLAB runs as does not have permission to read/write the job files.
If using a generic scheduler:
The environment variable MDCE_DECODE_FUNCTION
was
not defined before the MATLAB worker started.
The decode function was not on the worker’s path.
If your job returned no results (i.e., fetchOutputs(job)
returns
an empty cell array), it is probable that the job failed and some
of its tasks have their Error
properties set.
You can use the following code to identify tasks with error messages:
errmsgs = get(yourjob.Tasks, {'ErrorMessage'}); nonempty = ~cellfun(@isempty, errmsgs); celldisp(errmsgs(nonempty));
This code displays the nonempty error messages of the tasks
found in the job object yourjob
.
If you are using a supported third-party scheduler, you can
use the getDebugLog
function
to read the debug log from the scheduler for a particular job or task.
For example, find the failed job on your LSF scheduler, and read its debug log:
c = parcluster('my_lsf_profile') failedjob = findJob(c, 'State', 'failed'); message = getDebugLog(c, failedjob(1))
For testing connectivity between the client machine and the machines of your compute cluster, you can use Admin Center. For more information about Admin Center, including how to start it and how to test connectivity, see Start Admin Center (MATLAB Distributed Computing Server) and Test Connectivity (MATLAB Distributed Computing Server).
Detailed instructions for other methods of diagnosing connection problems between the client and MJS can be found in some of the Bug Reports listed on the MathWorks Web site.
The following sections can help you identify the general nature of some connection problems.
If you cannot locate or connect to your MJS with parcluster
,
the most likely reasons for this failure are:
The MJS is currently not running.
Firewalls do not allow traffic from the client to the MJS.
The client and the MJS are not running the same version of the software.
The client and the MJS cannot resolve each other’s short hostnames.
The MJS is using a nondefault BASE_PORT
setting
as defined in the mdce_def
file, and the Host
property
in the cluster profile does not specify this port.
If a warning message says that the MJS cannot open a TCP connection to the client computer, the most likely reasons for this are
Firewalls do not allow traffic from the MJS to the client.
The MJS cannot resolve the short hostname of the client
computer. Use pctconfig
to
change the hostname that the MJS will use for contacting the client.
The example code for generic schedulers with non-shared file systems contacts an sftp server to handle the file transfer to and from the cluster’s file system. This use of sftp is subject to all the normal sftp vulnerabilities. One problem that can occur results in an error message similar to this:
Caused by: Error using ==> RemoteClusterAccess>RemoteClusterAccess.waitForChoreToFinishOrError at 780 The following errors occurred in the com.mathworks.toolbox.distcomp.clusteraccess.UploadFilesChore: Could not send Job3.common.mat for job 3: One of your shell's init files contains a command that is writing to stdout, interfering with sftp. Access help com.mathworks.toolbox.distcomp.remote.spi.plugin.SftpExtraBytesFromShellException: One of your shell's init files contains a command that is writing to stdout, interfering with sftp. Find and wrap the command with a conditional test, such as if ($?TERM != 0) then if ("$TERM" != "dumb") then /your command/ endif endif : 4: Received message is too long: 1718579037
The telling symptom is the phrase "Received message
is too long:
" followed by a very large number.
The sftp server starts a shell, usually bash or tcsh, to set your standard read and write permissions appropriately before transferring files. The server initializes the shell in the standard way, calling files like .bashrc and .cshrc. This problem happens if your shell emits text to standard out when it starts. That text is transferred back to the sftp client running inside MATLAB, and is interpreted as the size of the sftp server's response message.
To work around this error, locate the shell startup file code
that is emitting the text, and either remove it or bracket it within if
statements
to see if the sftp server is starting the shell:
if ($?TERM != 0) then if ("$TERM" != "dumb") then /your command/ endif endif
You can test this outside of MATLAB with a standard UNIX or Windows sftp command-line client before trying again in MATLAB. If the problem is not fixed, the error message persists:
> sftp yourSubmitMachine Connecting to yourSubmitMachine... Received message too long 1718579042
If the problem is fixed, you should see:
> sftp yourSubmitMachine Connecting to yourSubmitMachine...