Chapter 6 Error Messages, and Troubleshooting
This chapter describes the error messaging procedures of the grid engine system and
offers tips on how to resolve various common problems.
How the Software Retrieves Error Reports
The grid engine software reports errors
and warnings by logging messages into certain files or by sending email, or both.
The log files include message files and job STDERR output.
As soon as a job is started, the standard error
(STDERR) output of the job script is redirected to a file. The
default file name and location are used, or you can specify the filename and the location
with certain options of the qsub command. See the grid engine system man
pages for detailed information.
Separate messages
files exist for the sge_qmaster, the sge_schedd,
and the sge_execds. The files have the same file name: messages. The sge_qmaster log file resides in the master
spool directory. The sge_schedd message file resides in the scheduler
spool directory. The execution daemons' log files reside in the spool directories
of the execution daemons. See Spool Directories Under the Root Directory in N1 Grid Engine 6 Installation Guide for more information
about the spool directories.
Each message takes up a single line in the files. Each message is subdivided
into five components separated by the vertical bar sign (|).
The components
of a message are as follows:
-
The first component is a time stamp for the message.
-
The second component specifies the grid engine system daemon that generates
the message.
-
The third component is the name of the host where the daemon runs.
-
The fourth is a message type. The message type is one of the following:
-
N for notice – for informational purposes
-
I for info – for informational purposes
-
W for warning
-
E for error – an error condition has been
detected
-
C for critical – can lead to a program abort
Use the loglevel parameter in the cluster configuration to
specify on a global basis or a local basis what message types you want to log.
-
The fifth component is the message text.
Note –
If an error log file is not accessible for some reason, the grid engine system tries
to log the error message to the files /tmp/sge_qmaster_messages, /tmp/sge_schedd_messages, or /tmp/sge_execd_messages on
the corresponding host.
In some circumstances, the grid engine system notifies users, administrators,
or both, about error events by email. The email messages sent by the grid engine system do
not contain a message body. The message text is fully contained in the mail subject
field.
Consequences of Different Error or Exit Codes
The following table lists the consequences of different job-related error codes
or exit codes. These codes are valid for every type of job.
Table 6–1 Job-Related Error or
Exit Codes
|
Script/Method
|
Exit or Error Code
|
Consequence
|
|
Job script
|
0
|
Success
|
|
|
99
|
Requeue
|
|
|
Rest
|
Success: exit code in accounting file
|
|
|
|
|
|
prolog/epilog
|
0
|
Success
|
|
|
99
|
Requeue
|
|
|
Rest
|
Queue error state, job requeued
|
The following table lists the consequences of error codes or exit codes of jobs
related to parallel environment (PE) configuration.
Table 6–2 Parallel-Environment-Related
Error or Exit Codes
|
Script/Method
|
Exit or Error Code
|
Consequence
|
|
pe_start
|
0
|
Success
|
|
|
Rest
|
Queue set to error state, job requeued
|
|
|
|
|
|
pe_stop
|
0
|
Success
|
|
|
Rest
|
Queue set to error state, job not requeued
|
The following table lists the consequences of error codes or exit codes of jobs
related to queue configuration. These codes are valid only if corresponding methods
were overwritten.
Table 6–3 Queue-Related Error
or Exit Codes
|
Script/Method
|
Exit or Error Code
|
Consequence
|
|
Job starter
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
|
|
|
|
|
Suspend
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
|
|
|
|
|
Resume
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
|
|
|
|
|
Terminate
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
The following table lists the consequences of error or exit codes of jobs related
to checkpointing.
Table 6–4 Checkpointing-Related
Error or Exit Codes
|
Script/Method
|
Exit or Error Code
|
Consequence
|
|
Checkpoint
|
0
|
Success
|
|
|
Rest
|
Success. For kernel checkpoint, however, this means that the checkpoint was
not successful.
|
|
|
|
|
|
Migrate
|
0
|
Success
|
|
|
Rest
|
Success. For kernel checkpoint, however, this means that the checkpoint was
not successful. Migration will occur.
|
|
|
|
|
|
Restart
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
|
|
|
|
|
Clean
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
For jobs that run successfully, the qacct -j command output
shows a value of 0 in the failed field, and
the output shows the exit status of the job in the exit_status field.
However, the shepherd might not be able to run a job successfully. For example, the
epilog script might fail, or the shepherd might not be able to start the job. In such
cases, the failed field displays one of the code values listed
in the following table.
Table 6–5
qacct -j failed Field Codes
|
Code
|
Description
|
acctvalid
|
Meaning for Job
|
|
0
|
No failure
|
t
|
Job ran, exited normally
|
|
1
|
Presumably before job
|
f
|
Job could not be started
|
|
3
|
Before writing config
|
f
|
Job could not be started
|
|
4
|
Before writing PID
|
f
|
Job could not be started
|
|
5
|
On reading config file
|
f
|
Job could not be started
|
|
6
|
Setting processor set
|
f
|
Job could not be started
|
|
7
|
Before prolog
|
f
|
Job could not be started
|
|
8
|
In prolog
|
f
|
Job could not be started
|
|
9
|
Before pestart
|
f
|
Job could not be started
|
|
10
|
In pestart
|
f
|
Job could not be started
|
|
11
|
Before job
|
f
|
Job could not be started
|
|
12
|
Before pestop
|
t
|
Job ran, failed before calling PE stop procedure
|
|
13
|
In pestop
|
t
|
Job ran, PE stop procedure failed
|
|
14
|
Before epilog
|
t
|
Job ran, failed before calling epilog script
|
|
15
|
In epilog
|
t
|
Job ran, failed in epilog script
|
|
16
|
Releasing processor set
|
t
|
Job ran, processor set could not be released
|
|
24
|
Migrating (checkpointing jobs)
|
t
|
Job ran, job will be migrated
|
|
25
|
Rescheduling
|
t
|
Job ran, job will be rescheduled
|
|
26
|
Opening output file
|
f
|
Job could not be started, stderr/stdout file could not be opened
|
|
27
|
Searching requested shell
|
f
|
Job could not be started, shell not found
|
|
28
|
Changing to working directory
|
f
|
Job could not be started, error changing to start directory
|
|
100
|
Assumedly after job
|
t
|
Job ran, job killed by a signal
|
The Code column lists the value of the failed field. The
Description column lists the text that appears in the qacct -j output.
If acctvalid is set to t, the job accounting
values are valid. If acctvalid is set to f,
the resource usage values of the accounting record are not valid. The Meaning for
Job column indicates whether the job ran or not.
Running Grid Engine System Programs in Debug Mode
For some severe error conditions,
the error-logging mechanism might not yield sufficient information to identify the
problems. Therefore, the grid engine system offers the ability to run almost all ancillary
programs and the daemons in debug mode. Different debug levels
vary in the extent and depth of information that is provided. The debug levels range
from zero through 10, with 10 being the level delivering the most detailed information
and zero turning off debugging.
To set a debug level, an extension to your .cshrc or .profile resource files is provided with the distribution of the grid engine system.
For csh or tcsh users, the file sge-root/util/dl.csh is included. For sh or ksh users, the corresponding file is named sge-root/util/dl.sh. The files must be sourced into
your standard resource file. As csh or tcsh user,
include the following line in your .cshrc file:
source sge-root/util/dl.csh
|
As sh or ksh user, include the following
line in your .profile file:
As soon as you log out and log in again, you can use the following
command to set a debug level:
If level is greater than 0, starting a grid engine system command forces the command
to write trace output to STDOUT. The trace output can contain warning
messages, status messages, and error messages, as well as the names of the program
modules that are called internally. The messages also include line number information,
which is helpful for error reporting, depending on the debug level you specify.
Note –
To watch a debug trace, you should use a window with a large scroll-line
buffer. For example, you might use a scroll-line buffer of 1000 lines.
Note –
If your window is an xterm, you might want to use the xterm logging mechanism to examine the trace output later on.
If you run one of the grid engine system daemons in debug mode, the daemons keep their
terminal connection to write the trace output. You can abort the terminal connections
by typing the interrupt character of the terminal emulation you use. For example,
you might use Control-C.
To switch off debug mode, set the debug level back to 0.
Setting the dbwriter Debug Level
The sgedbwriter script starts the dbwriter program.
The script is located in sge_root/dbwriter/bin/sgedbwriter. The sgedbwriter script reads the dbwriter configuration file, dbwriter.conf. This configuration
file is located in sge_root/cell/common/dbwriter.conf. This configuration file sets the debug
level of dbwriter. For example:
#
# Debug level
# Valid values: WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL
#
DBWRITER_DEBUG=INFO
|
You can use the –debug option of the dbwriter command to change the number of messages that the dbwriter produces.
In general, you should use the default debug level, which is info.
If you use a more verbose debug level, you substantially increase the amount of data
output by dbwriter.
You can specify the following debug levels:
-
warning
-
Displays only severe errors and warnings.
-
info
-
Adds a number of informational messages. info is the default debug
level.
-
config
-
Gives additional information that is related to dbwriter configuration, for example, about the processing of rules.
-
fine
-
Produces more information. If you choose this debug level, all SQL
statements run by dbwriter are output.
-
finer
-
For debugging.
-
finest
-
For debugging.
-
all
-
Displays information for all levels. For debugging.
Diagnosing Problems
The grid engine system offers several reporting methods to help you diagnose problems.
The following sections outline their uses.
Pending Jobs Not Being Dispatched
Sometimes a pending job is obviously capable of being run, but the job does
not get dispatched. To diagnose the reason, the grid engine system offers a pair of utilities
and options, qstat -j job-id and qalter-w v job-id.
-
qstat -j job-id
When enabled, qstat -j job-id provides
a list of reasons why a certain job was not dispatched in the last scheduling run.
This monitoring can be enabled or disabled. You might want to disable monitoring because
it can cause undesired communication overhead between the schedd daemon
and qmaster. See schedd_job_info in the sched_conf(5) man page. The following example shows output for a job with
the ID 242059:
% qstat -j 242059
scheduling info: queue "fangorn.q" dropped because it is temporarily not available
queue "lolek.q" dropped because it is temporarily not available
queue "balrog.q" dropped because it is temporarily not available
queue "saruman.q" dropped because it is full
cannot run in queue "bilbur.q" because it is not contained in its hard queuelist (-q)
cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q)
has no permission for host "ori"
|
This information is generated directly by the schedd daemon.
The generating of this information takes the current usage of the cluster into account.
Sometimes this information does not provide what you are looking for. For example,
if all queue slots are already occupied by jobs of other users, no detailed message
is generated for the job you are interested in.
-
qalter -w v job-id
This command lists the reasons why a job is not dispatchable in principle. For
this purpose, a dry scheduling run is performed. All consumable resources, as well
as all slots, are considered to be fully available for this job. Similarly, all load
values are ignored because these values vary.
Job or Queue Reported in Error State E
Job or queue errors are indicated by an uppercase E in the qstat output.
A job enters the error state when the grid engine system tries to run a job but fails
for a reason that is specific to the job.
A queue enters the error state when the grid engine system tries to run a job but fails
for a reason that is specific to the queue.
The grid engine system offers a set of possibilities for users and administrators to
gather diagnosis information in case of job execution errors. Both the queue and the
job error states result from a failed job execution. Therefore the diagnosis possibilities
are applicable to both types of error states.
-
User abort mail. If jobs are submitted
with the qsub -m a command, abort mail is sent to the address specified
with the -M user[@host] option. The abort mail contains diagnosis information about job errors.
Abort mail is the recommended source of information for users.
-
qacct accounting. If
no abort mail is available, the user can run the qacct -j command.
This command gets information about the job error from the grid engine system's job accounting
function.
-
Administrator abort mail. An administrator
can order administrator mails about job execution problems by specifying an appropriate
email address. See under administrator_mail on the sge_conf(5) man page. Administrator mail contains more detailed diagnosis information
than user abort mail. Administrator mail is the recommended method in case of frequent
job execution errors.
-
Messages files. If no administrator
mail is available, you should investigate the qmaster messages file first. You can find entries that are related to a certain
job by searching for the appropriate job ID. In the default installation, the qmaster messages file is sge-root/cell/spool/qmaster/messages.
You can sometimes find additional information in the messages of the execd daemon from which the job was started. Use qacct -j job-id to discover the host from which the job was started, and search
in sge-root/cell/spool/host/messages for the job ID.
Troubleshooting Common Problems
This section provides information to help you diagnose and respond to the cause
of common problems.
-
Problem — The output file
for your job says, Warning: no access to tty; thus no job control in this
shell....
-
Possible cause — One or more
of your login files contain an stty command. These commands are
useful only if a terminal is present.
-
Possible solution — No terminal
is associated with batch jobs. You must remove all stty commands
from your login files, or you must bracket such commands with an if statement.
The if statement should check for a terminal before processing.
The following example shows an if statement:
/bin/csh:
stty -g # checks terminal status
if ($status == 0) # succeeds if a
terminal is present
<put all stty commands in here>
endif
|
-
Problem — The job standard
error log file says `tty`: Ambiguous. However,
no reference to tty exists in the user's shell that is called in
the job script.
-
Possible cause — shell_start_mode is, by default, posix_compliant. Therefore
all job scripts run with the shell that is specified in the queue definition. The
scripts do not run with the shell that is specified on the first line of the job script.
-
Possible solution — Use the -S flag to the qsub command, or change shell_start_mode to unix_behavior.
-
Problem — You can run your
job script from the command line, but the job script fails when you run it using the qsub command.
-
Possible cause — Process
limits might be being set for your job. To test whether limits are being set, write
a test script that performs limit and limit -h functions.
Run both functions interactively, at the shell prompt and using the qsub command, to compare the results.
-
Possible solution — Remove
any commands in configuration files that sets limits in your shell.
-
Problem — Execution
hosts report a load of 99.99.
-
Possible cause — The execd daemon is not running on the host.
Possible solution — As root, start up the execd daemon on the execution host by running the $SGE_ROOT/default/common/'rcsge' script.
-
Possible cause — A default
domain is incorrectly specified.
Possible solution — As the grid engine system administrator, run the qconf -mconf command and change the default_domain variable to none.
-
Possible cause — The qmaster host sees the name of the execution host as different from the name
that the execution host sees for itself.
Possible
solution — If you are using DNS to resolve the host names of your
compute cluster, configure /etc/hosts and NIS to return the fully
qualified domain name (FQDN) as the primary host name. Of course, you can still define
and use the short alias name, for example, 168.0.0.1 myhost.dom.com myhost.
If you are not using DNS, make sure that
all of your /etc/hosts files and your NIS table are consistent,
for example, 168.0.0.1 myhost.corp myhost or 168.0.0.1
myhost
-
Problem — Every 30 seconds
a warning that is similar to the following message is printed to cell/spool/host/messages:
Tue Jan 23 21:20:46 2001|execd|meta|W|local
configuration meta not defined - using global configuration
|
But cell/common/local_conf contains
a file for each host, with FQDN.
-
Possible cause — The host
name resolving at your machine meta returns the short name, but
at your master machine, meta with FQDN is returned.
-
Possible solution — Make
sure that all of your /etc/hosts files and your NIS table are consistent
in this respect. In this example, a line such as the following text could erroneously
be included in the /etc/hosts file of the host meta:
168.0.0.1 meta meta.your.domain
The line
should instead be:
168.0.0.1 meta.your.domain meta.
-
Problem — Occasionally you
see CHECKSUM ERROR, WRITE ERROR, or READ ERROR messages in the messages files of the daemons.
-
Problem — Jobs finish on
a particular queue and return the following message in qmaster/messages:
Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1
finished on host exechost
|
Then you see the following error messages in the execution host's exechost/messages file:
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory
"active_jobs/490.1" for reaping job 490.1
|
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory
"active_jobs/490.1": opendir(active_jobs/490.1) failed:
Input/output error
|
-
Possible cause — The $SGE_ROOT directory, which is automounted, is being unmounted, causing the sge_execd daemon to lose its current working directory.
-
Possible solution — Use a
local spool directory for your execd host. Set the parameter execd_spool_dir, using qmon or the qconf command.
-
Problem — When submitting
interactive jobs with the qrsh utility, you get the following error
message:
% qrsh -l mem_free=1G error: error: no suitable queues
|
However, queues are available for submitting batch jobs with the qsub command. These queues can be queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G.
-
Possible cause — The message error: no suitable queues results from the -w e submit
option, which is active by default for interactive jobs such as qrsh.
Look for -w e on the qrsh(1) man page. This
option causes the submit command to fail if the qmaster does not
know for sure that the job is dispatchable according to the current cluster configuration.
The intention of this mechanism is to decline job requests in advance, in case the
requests can't be granted.
-
Possible solution — In this
case, mem_free is configured to be a consumable resource, but you
have not specified the amount of memory that is to be available at each host. The
memory load values are deliberately not considered for this check because memory load
values vary. Thus they can't be seen as part of the cluster configuration. You can
do one of the following:
-
Omit this check generally by explicitly
overriding the qrsh default option -w e with
the -w n option. You can also put this command into sge-root/cell/common/cod_request.
-
If you intend to manage mem_free as a consumable resource, specify the mem_free capacity
for your hosts in complex_values of host_conf by
using qconf -me hostname.
-
If you don't intend to manage mem_free as a consumable resource, make it a nonconsumable resource again
in the consumable column of complex(5) by using qconf -mc hostname.
-
Problem — qrsh won't
dispatch to the same node it is on. From a qsh shell you get a
message such as the following:
host2 [49]% qrsh -inherit host2 hostname
error: executing task of job 1 failed:
host2 [50]% qrsh -inherit host4 hostname
host4
|
-
Possible cause — gid_range is not sufficient. gid_range should be defined
as a range, not as a single number. The grid engine system assigns each job on a host a distinct gid.
-
Possible solution — Adjust
the gid_range with the qconf -mconf command
or with QMON. The suggested range is as follows:
-
Problem — qrsh -inherit
-V does not work when used inside a parallel job. You get the following
message:
cannot get connection to "qlogin_starter"
|
-
Possible cause — This problem
occurs with nested qrsh calls. The problem is caused by the -V option. The first qrsh -inherit call sets the environment
variable TASK_ID. TASK_ID is the ID of the tightly
integrated task within the parallel job. The second qrsh -inherit call
uses this environment variable for registering its task. The command fails as it tries
to start a task with the same ID as the already-running first task.
-
Possible solution — You can
either unset TASK_ID before calling qrsh -inherit,
or choose to use the -v option instead of -V.
This option exports only the environment variables that you really need.
-
Problem — qrsh does
not seem to work at all. Messages like the following are generated:
host2$ qrsh -verbose hostname
local configuration host2 not defined - using global configuration
waiting for interactive job to be scheduled ...
Your interactive job 88 has been successfully scheduled.
Establishing /share/gridware/utilbin/solaris64/rsh session
to host exehost ...
rcmd: socket: Permission denied
/share/gridware/utilbin/solaris64/rsh exited with exit code 1
reading exit code from shepherd ...
error: error waiting on socket for client to connect:
Interrupted system call
error: error reading return code of remote command
cleaning up after abnormal exit of
/share/gridware/utilbin/solaris64/rsh
host2$
|
-
Possible cause — Permissions
for qrsh are not set properly.
-
Possible solution — Check
the permissions of the following files, which are located in $SGE_ROOT/utilbin/. (Note that rlogin and rsh must be setuid and owned by root.)
-r-s--x--x
1 root root 28856 Sep 18 06:00 rlogin*
-r-s--x--x 1
root root 19808 Sep 18 06:00 rsh*
-rwxr-xr-x 1 sgeadmin
adm 128160 Sep 18 06:00 rshd*
Note –
The sge-root directory also needs to be NFS-mounted
with the setuid option. If sge-root is
mounted with nosuid from your submit client, qrsh and
associated commands will not work.
-
Problem – When you try to
start a distributed make, qmake exits with the following error
message:
qrsh_starter: executing child process
qmake failed: No such file or directory
|
-
Possible cause — The grid engine system starts
an instance of qmake on the execution host. If the grid engine system environment,
especially the PATH variable, is not set up in the user's shell
resource file (.profile or .cshrc), this qmake call fails.
-
Possible solution — Use the -v option to export the PATH environment variable to
the qmake job. A typical qmake call is as follows:
qmake -v PATH -cwd -pe make 2-10 --
|
-
Problem — When using the qmake utility, you get the following error message:
waiting for interactive job to be scheduled ...timeout (4 s)
expired while waiting on socket fd 5
Your "qrsh" request could not be scheduled, try again later.
|
-
Possible cause — The ARCH environment variable could be set incorrectly in the shell from which qmake was called.
-
Possible solution – Set the ARCH variable correctly to a supported value that matches an available host
in your cluster, or else specify the correct value at submit time, for example, qmake -v ARCH=solaris64 ...
Typical Accounting and Reporting Console Errors
Problem:
The installation of the Sun Web console Version 2.0.3 fails with the
follow error message:
# ./inst_reporting
...
Register the N1 SGE reporting module in the webconsole
Registering com.sun.grid.arco_6u3.
Starting Sun(TM) Web Console Version 2.0.3...
Ambiguous output redirect.
|
Solution:
. This Sun Web Console Version can only be installed by the user noacces who has /bin/sh as their login shell. The user
must be added with the following command:
# useradd -u 60002 -g 60002 -d /tmp -s /bin/sh -c "No Access User" noaccess
|
Problem:
The table/view dropdown menu of a simple query definition does not
contain any entry, but the tables are defined in the database.
Solution:
The problem normally occurs if Oracle is used as the database. During
the installation of the reporting module the wrong database schema name has been specified.
For Oracle, the database schema name is equal to the name of the database user which
is used by dbwriter (the default name is arco_write).
For Postgres, the database schema name should be public.
Problem:
Connection refused.
Solution:
The smcwebserver might be down. Start or restart
the smcwebserver.
Problem:
The list of queries or the list of results is empty.
Solution:
The cause can be any of the following:
-
The database is down. Start or restart the database.
-
No more database connections are available. Increase the number of
allowable connections to the database.
-
An error exists in the configuration file of the application. Check
the configuration for wrong database users, wrong user passwords, or wrong type of
database, and then restart the application.
-
No queries are available. If the query directory /var/spool/arco/queries is not empty, the following errors might have occurred:
Problem:
The list of available database tables is empty.
Solution:
The cause can be any of the following:
-
The database is down. Start or restart the database.
-
No more database connections are available. Increase the number of
allowable connections to the database.
-
An error exists in the configuration file of the application. Check
the configuration for wrong database users, wrong user passwords, or wrong type of
database, and then restart the application.
Problem:
The list of selectable fields is empty.
Solution:
No table is selected. Select a table from the list.
Problem:
The list of filters is empty.
Solution:
No fields are selected. Define at least one field.
Problem:
The sort list is empty.
Solution:
No fields are selected. Define at least one field.
Problem:
A defined filter is not used.
Solution:
The filter may be inactive. Modify the unused filter and make it active.
Problem:
The late binding in the advanced query is ignored, but the execution
runs into an error.
Solution:
The late binding macro has a syntactical error. The correct syntax for
the late binding macro in the advanced query is as follows:
latebinding{attribute;operator}
latebinding{attribute;operator;defaultvalue}
|
Problem:
The breadcrumb is used to move back, but the login screen is shown.
Solution:
The session timed out. Log in again, or raise the session time in the app.xml.
Problem:
The view configuration is defined, but the default configuration is
shown.
Solution:
The defined view configuration is not set to be visible. Open the view
configuration and define the view configuration to be used.
Problem:
The view configuration is defined, but the last configuration is shown.
Solution:
The defined view configuration is not set to be visible. Open the view
configuration and define the view configuration to be used.
Problem:
The execution of a query takes a very long time.
Solution:
The results coming from the database are very large. Set a limit for
the results, or extend the filter conditions.