Chapter 9 Fine Tuning, Error Messages, and Troubleshooting
This chapter describes some ways to fine-tune your grid engine system environment.
The chapter also describes the error messaging procedures and offers
tips on how to resolve various common problems.
This chapter includes the following sections:
Fine-Tuning Your Grid Environment
The grid engine system is a full-function, general-purpose distributed
resource management tool. The scheduler component of the system supports
a wide range of different compute farm scenarios. To get the maximum
performance from your compute environment, you should review the features
that are enabled. You should then determine which features you really
need to solve your load management problem. Disabling some of these
features can improve performance on the throughput of your cluster.
Scheduler Monitoring
Scheduler monitoring can help you to find out why certain jobs
are not dispatched. However, providing this information for all jobs
at all times can consume resources. You usually do not need this much
information.
To disable scheduler monitoring, set schedd_job_info to
false in the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.
Finished Jobs
In the case of array jobs, the finished job list in qmaster can become quite large. By switching the finished job list
off, you save memory and speed up the qstat process,
because qstat also fetches the finished jobs list.
To turn off the finished job list function, set finished_jobs to zero in the cluster configuration. See Adding and Modifying Global and Host Configurations With QMON, and the sge_conf(5) man page.
Job Validation
Forced validation at job submission time can be a valuable procedure
to prevent nondispatchable jobs from forever remaining in a pending
state. However, job validation can also be a time-consuming task.
Job validation can be especially time-consuming in heterogeneous environments
with different execution nodes and consumable resources, and in which
all users have their own job profiles. In homogeneous environments
with only a few different jobs, a general job validation usually can
be omitted.
To disable job verification, add the qsub option –w n in the cluster-wide default requests. See Submitting Advanced Jobs With QMON in Sun N1 Grid Engine 6.1 User’s Guide, and the sge_request(5) man page.
Load Thresholds and Suspend Thresholds
Load thresholds are needed if you deliberately oversubscribe
your machines and you need to prevent excessive system load. Suspend
thresholds are also used to prevent overloading the system.
Another case where you want to prevent the overloading of a
node is when the execution node is still open for interactive load.
Interactive load is not under the control of the grid engine system.
A compute farm might be more single-purpose. For example, each
CPU at a compute node might be represented by only one queue slot,
and no interactive load might be expected at these nodes. In such
cases, you can omit load_thresholds.
To disable both thresholds, set load_thresholds to none and suspend_thresholds to none. See Configuring Load and Suspend Thresholds, and the queue_conf(5) man page.
Load Adjustments
Load adjustments are used to increase the measured load after
a job is dispatched. This mechanism prevents oversubscription of machines
that is caused by the delay between job dispatching and the corresponding
load impact. You can switch off load adjustments if you do not need
them. Load adjustments impose on the scheduler some additional work
in connection with sorting hosts and load thresholds verification.
To disable load adjustments, set job_load_adjustments to none and load_adjustment_decay_time to
zero in the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.
Immediate Scheduling
The default for the grid engine system is to start scheduling runs in
a fixed schedule interval. A good feature of fixed intervals is that
they limit the CPU time consumption of the qmaster and
the scheduler. A bad feature is that fixed intervals choke the scheduler,
artificially resulting in a limited throughput. Many compute farms
have machines specifically dedicated to qmaster and
the scheduler, and such setups provide no reason to choke the scheduler.
See schedule_interval in sched_conf(5).
You can configure immediate scheduling by using the flush_submit_sec and flush_finish_sec parameters of the
scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.
If immediate scheduling is activated, the throughput of a compute
farm is limited only by the power of the machine that is hosting sge_qmaster and the scheduler.
Urgency Policy and Resource Reservation
The urgency policy enables you to customize job priority schemes
that are resource-dependent. Such job priority schemes include the
following:
The implementing of both objectives is especially valuable if
you are using resource reservation.
Using DTrace for Performance Tuning
Troubleshooting in a distributed system that spans potentially
thousands of active components can challenge even the most experienced
system administrator. In practice, Grid Engine administrators have
no explicit mechanism for identifying and reproducing issues that
lead to degraded performance in their production environments. In
the Solaris 10 environment, you can use the DTrace utility to monitor
the on-site performance of the Grid Engine master component. DTrace
is a comprehensive framework for tracing dynamic events in Solaris
10 environments. For general information about DTrace, see http://www.sun.com/bigadmin/content/dtrace/ and
the dtrace man page. For detailed information about using DTrace with N1 Grid Engine 6.1 software,
view the $SGE_ROOT/dtrace/README_dtrace.txt file.
Tuning Performance from the Command Line
through DTrace
If you can use Solaris 10 DTrace, you can use the $SGE_ROOT/dtrace/monitor.sh script to monitor a Grid Engine master and look for any
bottlenecks. The monitor.sh script supports
the following options:
-
-interval value
-
Specify statistics interval to use. The default is 15sec. A larger interval results in coarser statistics,
while a smaller value provides more refined results. Most useful values
range from 1sec to 24hours.
-
-cell cell-name
-
Required if $SGE_CELL is not “default.”
-
-spooling
-
Display qmaster spooling probes
in addition to statistics. This option enables you to view more specific
information about a presumed spooling bottleneck.
-
-requests
-
Shows incoming qmaster request
probes. This option enables you to view more specific information
to evaluate instances in which someone is flooding your qmaster.
Note –
Any critical, error, or warning messages appear in monitor.sh output.
Analyzing Bottlenecks on the Grid Engine
Master
To provide effective performance tuning, you must understand
the bottlenecks of distributed systems. The $SGE_ROOT/dtrace/monitor.sh script measures throughput-relevant data of the running
Grid Engine master and compiles this data into a few indices that
are printed in a single-line view per interval. This view shows four
main categories of information:
-
Spooling — Indicates the number of operations
that spooled to the qmaster process and the elapsed time
-
Request handling — Shows the number of messages
sent and received of various types, such as reports, GDI requests,
and ACK messages.
-
Scheduling — Indicates the number of scheduling
requests sent to the schedd process and the elapsed time
-
Qmaster processing — Includes information about
qmaster/schedd communications, qmaster requesst I/O activities, and
qmaster lock and unlock requests.
For more information, see the example below.
Sample DTrace Output for Bottleneck Analysis
The following monitoring output sample illustrates a case where
a Grid Engine master bottleneck can be detected. The example shows
the following information:
-
For qmaster spooling activities:
-
#wrt — Number of qmaster
write operations via spool_write_object() and spool_delete_object().
Almost every significant write operation goes through this function.
-
wrt/ms — Total time all threads
spend in spool_write_object() in microseconds.
-
For qmaster message processing:
-
#rep — Number of reports
qmaster processed through sge_c_report(). Most data sent by execd
functions to qmaster are reflected here.
-
#gdi — Number of GDI requests
qmaster processed through do_gdi_request(). Almost anything sent
from client commands arrives as a GDI request, although GDI requests
can also come from exexd functions and the scheduler.
-
#ack — Number of ACK messages
qmaster processed through do_c_ack(). High numbers of ACK messages
might indicate job signalling, although ACK messages are used also
for other purposes.
-
For schedd scheduling activities:
-
#dsp — Number of calls to
dispatch_jobs() in schedd. Each call to dispatch_jobs() can seen as
a scheduling run.
-
dsp/ms — Total time scheduler
spent in all calls to dispatch_jobs().
-
#sad — Number of calls to
select_assign_debit(). Each call to select_assign_debit() can be
seen as a try of the scheduler to find an assignement or a reservation
for a job.
-
For qmaster processing:
-
#snd — Number of event packages
qmaster sends to schedd. If that number goes down to zero over longer
time, something is wrong and qmaster/schedd get out of sync.
-
#rcv — Number of event packages
schedd receives from qmaster. If that number goes down to zero over
longer time, something is wrong and qmaster/schedd get out of sync.
-
#in++ — Number of messages
added to qmaster received messages buffer.
-
#in-- — Number of messages
removed from qmaster received messages buffer. If more messages are
added than removed during an interval, the total of messages not yet
processed is about to grow.
-
#out++ — Number of messages
added to qmaster send messages buffer.
-
#out-- — Number of messages
removed from qmaster send messages buffer. If more messages are added
than removed during an interval, the total of messages not yet delivered
is about to grow.
-
#lck0/#ulck0 — Number of
calls to sge_lock()/sge_unlock() for qmaster “global”
lock. This lock must always be obtained, when qmaster-internal lists
(job list, queue list, etc.) are accessed.
-
#lck1/#ulck1 — Number of
calls to sge_lock()/sge_unlock() for qmaster “master_config”
lock. This lock is a secondary lock, but is also important.
Note –
The specific columns displayed on your system might differ
from the example.
In this example, performance degraded between 17:40:32 and 17:41:05.
CPU ID FUNCTION:NAME
0 1 :BEGIN Time | #wrt wrt/ms |#rep #gdi #ack| #dsp dsp/ms #sad| #snd #rcv| #in++ #in-- #out++ #out--| #lck0 #ulck0 #lck1 #ulck1
0 36909 :tick-3sec 2006 Nov 24 17:39:23 | 43 3| 0 8 4| 3 691 121| 4 4| 11 11 15 15| 68 68 289 288
0 36909 :tick-3sec 2006 Nov 24 17:39:26 | 83 16| 0 10 3| 3 699 122| 3 3| 14 13 17 17| 90 90 681 681
0 36909 :tick-3sec 2006 Nov 24 17:39:29 | 117 24| 0 9 4| 4 1092 198| 4 4| 13 13 17 17| 71 71 591 591
0 36909 :tick-3sec 2006 Nov 24 17:39:32 | 19 4| 0 9 3| 3 591 147| 3 3| 12 12 15 15| 44 43 249 249
0 36909 :tick-3sec 2006 Nov 24 17:39:35 | 144 28| 0 9 4| 4 1012 173| 4 4| 13 13 17 17| 61 62 1246 1247
0 36909 :tick-3sec 2006 Nov 24 17:39:38 | 46 5| 0 8 3| 3 705 122| 3 3| 11 11 14 14| 67 67 293 293
0 36909 :tick-3sec 2006 Nov 24 17:39:41 | 154 31| 0 9 3| 4 894 198| 3 3| 13 13 16 16| 73 72 968 969
0 36909 :tick-3sec 2006 Nov 24 17:39:44 | 46 5| 0 10 4| 4 971 162| 4 4| 13 13 17 17| 71 72 304 304
0 36909 :tick-3sec 2006 Nov 24 17:39:47 | 154 29| 0 8 3| 3 739 158| 3 3| 11 11 14 14| 67 67 990 990
0 36909 :tick-3sec 2006 Nov 24 17:39:50 | 46 5| 0 10 4| 4 815 162| 4 4| 14 14 18 18| 76 76 692 693
0 36909 :tick-3sec 2006 Nov 24 17:39:53 | 74 15| 0 8 3| 3 746 136| 3 3| 12 12 15 15| 54 53 571 571
0 36909 :tick-3sec 2006 Nov 24 17:39:56 | 116 20| 0 11 4| 4 992 184| 4 4| 14 14 18 18| 80 81 669 669
0 36909 :tick-3sec 2006 Nov 24 17:39:59 | 87 18| 0 11 4| 4 851 176| 5 4| 15 15 21 21| 77 76 670 670
0 36909 :tick-3sec 2006 Nov 24 17:40:02 | 109 20| 0 12 5| 4 930 184| 4 5| 17 17 20 20| 77 78 624 624
0 36909 :tick-3sec 2006 Nov 24 17:40:05 | 88 15| 0 9 3| 4 995 176| 3 3| 12 12 15 15| 71 71 1026 1026
0 36909 :tick-3sec 2006 Nov 24 17:40:08 | 112 20| 0 12 4| 4 927 184| 5 4| 16 16 22 22| 81 81 652 652
0 36909 :tick-3sec 2006 Nov 24 17:40:11 | 32 6| 0 7 4| 3 618 121| 3 4| 11 11 13 13| 54 53 336 336
0 36909 :tick-3sec 2006 Nov 24 17:40:14 | 145 30| 0 11 4| 4 988 199| 4 4| 15 15 19 19| 64 65 827 827
0 36909 :tick-3sec 2006 Nov 24 17:40:17 | 43 3| 0 7 3| 3 618 121| 3 3| 10 10 13 13| 64 64 286 286
0 36909 :tick-3sec 2006 Nov 24 17:40:20 | 157 31| 0 11 4| 4 977 199| 4 4| 15 15 19 19| 80 80 1406 1408
0 36909 :tick-3sec 2006 Nov 24 17:40:23 | 43 4| 0 7 3| 3 701 121| 3 3| 10 10 13 13| 64 64 285 285
0 36909 :tick-3sec 2006 Nov 24 17:40:26 | 73 18| 0 11 4| 4 948 171| 4 4| 15 15 19 19| 77 77 700 700
0 36909 :tick-3sec 2006 Nov 24 17:40:29 | 127 31| 0 10 4| 4 968 189| 4 4| 14 14 18 18| 74 74 584 584
0 36909 :tick-3sec 2006 Nov 24 17:40:32 | 10 3| 0 6 0| 1 203 41| 0 0| 58 8 62 62| 23 22 106 106
0 36909 :tick-3sec 2006 Nov 24 17:40:35 | 19 5| 0 5 0| 0 0 0| 0 0| 8 5 13 13| 30 30 200 200
0 36909 :tick-3sec 2006 Nov 24 17:40:38 | 16 5| 0 5 1| 0 0 0| 0 0| 5 6 10 10| 27 26 558 559
0 36909 :tick-3sec 2006 Nov 24 17:40:41 | 1 0| 0 4 0| 0 0 0| 0 0| 7 4 11 11| 9 9 34 34
0 36909 :tick-3sec 2006 Nov 24 17:40:44 | 0 0| 0 4 0| 0 0 0| 0 0| 7 4 11 11| 8 8 28 28
0 36909 :tick-3sec 2006 Nov 24 17:40:47 | 0 0| 0 6 0| 1 744 81| 1 1| 10 6 15 15| 14 14 33 33
0 36909 :tick-3sec 2006 Nov 24 17:40:50 | 1 0| 0 5 1| 0 0 0| 0 0| 8 6 14 14| 11 11 49 49
0 36909 :tick-3sec 2006 Nov 24 17:40:53 | 0 0| 0 4 0| 0 0 0| 0 0| 9 4 12 12| 6 7 28 28
0 36909 :tick-3sec 2006 Nov 24 17:40:56 | 0 0| 0 5 0| 0 0 0| 0 0| 8 5 13 13| 12 12 420 420
0 36909 :tick-3sec 2006 Nov 24 17:40:59 | 0 0| 0 4 0| 0 0 0| 0 0| 8 4 12 12| 9 8 30 30
0 36909 :tick-3sec 2006 Nov 24 17:41:02 | 0 0| 0 4 1| 0 0 0| 0 0| 12 5 16 16| 7 8 25 25
0 36909 :tick-3sec 2006 Nov 24 17:41:05 | 165 41| 0 48 60| 0 0 0| 1 1| 23 106 71 71| 96 97 1236 1236
0 36909 :tick-3sec 2006 Nov 24 17:41:08 | 178 28| 0 15 53| 4 965 206| 4 4| 68 68 75 75| 130 130 1336 1336
0 36909 :tick-3sec 2006 Nov 24 17:41:11 | 106 23| 0 27 35| 4 855 166| 4 4| 82 82 91 91| 115 114 1040 1040
0 36909 :tick-3sec 2006 Nov 24 17:41:14 | 198 37| 0 41 70| 4 1189 196| 4 4| 185 185 185 185| 134 135 1327 1327
0 36909 :tick-3sec 2006 Nov 24 17:41:17 | 16 5| 0 9 5| 4 940 161| 3 3| 17 17 20 20| 43 42 234 234
0 36909 :tick-3sec 2006 Nov 24 17:41:20 | 162 35| 0 13 8| 4 958 200| 4 4| 23 23 28 28| 80 81 1018 1018
0 36909 :tick-3sec 2006 Nov 24 17:41:23 | 44 6| 0 6 3| 2 544 81| 3 3| 8 8 11 11| 63 63 747 747
0 36909 :tick-3sec 2006 Nov 24 17:41:26 | 150 34| 0 13 6| 4 921 199| 4 4| 21 21 25 25| 73 72 923 923
0 36909 :tick-3sec 2006 Nov 24 17:41:29 | 43 3| 0 5 2| 2 506 81| 2 2| 7 7 9 9| 57 57 260 260
0 36909 :tick-3sec 2006 Nov 24 17:41:32 | 157 37| 0 9 3| 4 978 199| 3 3| 13 13 16 16| 73 72 970 970
0 36909 :tick-3sec 2006 Nov 24 17:41:35 | 43 3| 0 7 3| 2 512 85| 3 3| 9 9 12 12| 61 62 274 274
0 36909 :tick-3sec 2006 Nov 24 17:41:38 | 127 29| 0 8 3| 4 994 185| 3 3| 11 11 14 14| 68 68 1265 1265
0 36909 :tick-3sec 2006 Nov 24 17:41:41 | 66 11| 0 10 4| 4 973 171| 4 4| 14 14 18 18| 67 67 354 354
0 36909 :tick-3sec 2006 Nov 24 17:41:44 | 48 10| 0 8 3| 3 785 128| 3 3| 11 11 14 14| 52 51 399 399
0 36909 :tick-3sec 2006 Nov 24 17:41:47 | 142 31| 0 12 4| 4 913 192| 5 4| 17 17 23 23| 89 90 830 830
0 36909 :tick-3sec 2006 Nov 24 17:41:50 | 64 13| 0 11 5| 4 853 168| 4 5| 15 15 18 18| 75 75 542 542
|
How the Grid Engine Software Retrieves
Error Reports
The grid engine software reports errors and warnings by logging messages
into certain files or by sending email, or both. The log files include
message files and job STDERR output.
As soon as a job is started, the standard error (STDERR) output of the job script is redirected to a file. The
default file name and location are used, or you can specify the filename
and the location with certain options of the qsub command.
See the grid engine system man pages for detailed information.
Separate messages files exist for the sge_qmaster,
the sge_schedd, and the sge_execds.
The files have the same file name: messages.
The sge_qmaster log file resides in the master
spool directory. The sge_schedd message file resides
in the scheduler spool directory. The execution daemons' log files
reside in the spool directories of the execution daemons. See Spool Directories Under the Root Directory in Sun N1 Grid Engine 6.1 Installation Guide for
more information about the spool directories.
Each message takes up a single line in the files. Each message
is subdivided into five components separated by the vertical bar sign
(|).
The components of a
message are as follows:
-
The first component is a time stamp for the message.
-
The second component specifies the daemon that generates
the message.
-
The third component is the name of the host where
the daemon runs.
-
The fourth is a message type. The message
type is one of the following:
-
N for notice – for informational
purposes
-
I for info – for informational
purposes
-
W for warning
-
E for error – an error condition
has been detected
-
C for critical – can lead
to a program abort
Use the loglevel parameter in the cluster
configuration to specify on a global basis or a local basis what message
types you want to log.
-
The fifth component is the message text.
Note –
If an error log file is not accessible for some reason,
the grid engine system tries to log the error message to the files /tmp/sge_qmaster_messages, /tmp/sge_schedd_messages, or /tmp/sge_execd_messages on the corresponding host.
In some circumstances, the grid engine system notifies
users, administrators, or both, about error events by email. The email
messages sent by the grid engine system do not contain a message body. The
message text is fully contained in the mail subject field.
Consequences of Different Error or Exit
Codes
The following table lists the consequences of different job-related
error codes or exit codes. These codes are valid for every type of
job.
Table 9–1 Job-Related
Error or Exit Codes
|
Script/Method
|
Exit or Error Code
|
Consequence
|
|
Job script
|
0
|
Success
|
|
|
99
|
Requeue
|
|
|
Rest
|
Success: exit code in accounting file
|
|
|
|
|
|
prolog/epilog
|
0
|
Success
|
|
|
99
|
Requeue
|
|
|
Rest
|
Queue error state, job requeued
|
The following
table lists the consequences of error codes or exit codes of jobs
related to parallel environment (PE) configuration.
Table 9–2 Parallel-Environment-Related
Error or Exit Codes
|
Script/Method
|
Exit or Error Code
|
Consequence
|
|
pe_start
|
0
|
Success
|
|
|
Rest
|
Queue set to error state, job requeued
|
|
|
|
|
|
pe_stop
|
0
|
Success
|
|
|
Rest
|
Queue set to error state, job not requeued
|
The following table lists the consequences
of error codes or exit codes of jobs related to queue configuration.
These codes are valid only if corresponding methods were overwritten.
Table 9–3 Queue-Related
Error or Exit Codes
|
Script/Method
|
Exit or Error Code
|
Consequence
|
|
Job starter
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
|
|
|
|
|
Suspend
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
|
|
|
|
|
Resume
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
|
|
|
|
|
Terminate
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
The following table
lists the consequences of error or exit codes of jobs related to checkpointing.
Table 9–4 Checkpointing-Related
Error or Exit Codes
|
Script/Method
|
Exit or Error Code
|
Consequence
|
|
Checkpoint
|
0
|
Success
|
|
|
Rest
|
Success. For kernel checkpoint, however, this means that the
checkpoint was not successful.
|
|
|
|
|
|
Migrate
|
0
|
Success
|
|
|
Rest
|
Success. For kernel checkpoint, however, this means that the
checkpoint was not successful. Migration will occur.
|
|
|
|
|
|
Restart
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
|
|
|
|
|
Clean
|
0
|
Success
|
|
|
Rest
|
Success, no other special meaning
|
Running Grid Engine System Programs in
Debug Mode
For some severe error conditions, the error-logging mechanism
might not yield sufficient information to identify the problems. Therefore,
the grid engine system offers the ability to run almost all ancillary programs
and the daemons in debug mode. Different debug
levels vary in the extent and depth of information that is provided.
The debug levels range from zero through 10, with 10 being the level
delivering the most detailed information and zero turning off debugging.
To set a debug level, an extension to your .cshrc or .profile resource files is provided with the distribution
of the grid engine system. For csh or tcsh users,
the file sge-root/util/dl.csh is
included. For sh or ksh users,
the corresponding file is named sge-root/util/dl.sh. The files must be sourced into your standard
resource file. As csh or tcsh user,
include the following line in your .cshrc file:
source sge-root/util/dl.csh
|
As sh or ksh user, include
the following line in your .profile file:
As soon as you log out and log in again, you can use the
following command to set a debug level:
If level is greater than 0,
starting a grid engine system command forces the command to write trace output
to STDOUT. The trace output can contain warning
messages, status messages, and error messages, as well as the names
of the program modules that are called internally. The messages also
include line number information, which is helpful for error reporting,
depending on the debug level you specify.
Note –
To watch a debug trace, you should use a window with a
large scroll-line buffer. For example, you might use a scroll-line
buffer of 1000 lines.
Note –
If your window is an xterm, you might
want to use the xterm logging mechanism to examine
the trace output later on.
If you run one of the grid engine system daemons in debug mode, the
daemons keep their terminal connection to write the trace output.
You can abort the terminal connections by typing the interrupt character
of the terminal emulation you use. For example, you might use Control-C.
To switch off debug mode, set the debug level back to 0.
Setting the dbwriter Debug
Level
The sgedbwriter script starts the dbwriter program. The script is located in sge_root/dbwriter/bin/sgedbwriter. The sgedbwriter script reads the dbwriter configuration
file, dbwriter.conf. This configuration file is
located in sge_root/cell/common/dbwriter.conf. This configuration
file sets the debug level of dbwriter. For example:
#
# Debug level
# Valid values: WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL
#
DBWRITER_DEBUG=INFO
|
You can use the –debug option of the dbwriter command to change the number of messages that the dbwriter produces. In general, you should use the default
debug level, which is info. If you use a more verbose
debug level, you substantially increase the amount of data output
by dbwriter.
You can specify the following debug levels:
-
warning
-
Displays only severe errors and warnings.
-
info
-
Adds a number of informational messages. info is
the default debug level.
-
config
-
Gives additional information that is related to dbwriter configuration, for example, about the processing
of rules.
-
fine
-
Produces more information. If you choose this debug
level, all SQL statements run by dbwriter are output.
-
finer
-
For debugging.
-
finest
-
For debugging.
-
all
-
Displays information for all levels. For debugging.
Diagnosing Problems
The grid engine system offers several reporting methods to help you
diagnose problems. The following sections outline their uses.
Pending Jobs Not Being Dispatched
Sometimes a pending job is obviously capable of being run, but
the job does not get dispatched. To diagnose the reason, the grid engine system offers
a pair of utilities and options, qstat -j job-id and qalter-w v job-id.
-
qstat -j job-id
When enabled, qstat
-j job-id provides a list of
reasons why a certain job was not dispatched in the last scheduling
run. This monitoring can be enabled or disabled. You might want to
disable monitoring because it can cause undesired communication overhead
between the sge_schedd daemon and sge_qmaster. See schedd_job_info in the sched_conf(5) man page. The following example shows output for a job
with the ID 242059:
% qstat -j 242059
scheduling info: queue "fangorn.q" dropped because it is temporarily not available
queue "lolek.q" dropped because it is temporarily not available
queue "balrog.q" dropped because it is temporarily not available
queue "saruman.q" dropped because it is full
cannot run in queue "bilbur.q" because it is not contained in its hard queuelist (-q)
cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q)
has no permission for host "ori"
|
This information is generated directly by the sge_schedd daemon.
The generating of this information takes the current usage of the
cluster into account. Sometimes this information does not provide
what you are looking for. For example, if all queue slots are already
occupied by jobs of other users, no detailed message is generated
for the job you are interested in.
-
qalter -w v job-id
This command lists the reasons why a job is not dispatchable
in principle. For this purpose, a dry scheduling run is performed.
All consumable resources, as well as all slots, are considered to
be fully available for this job. Similarly, all load values are ignored
because these values vary.
Job or Queue Reported in Error State E
Job or queue errors are indicated by an uppercase E in
the qstat output.
A job enters the error state when the grid engine system tries to run
a job but fails for a reason that is specific to the job.
A queue enters the error state when the grid engine system tries to
run a job but fails for a reason that is specific to the queue.
The grid engine system offers a set of possibilities for users and administrators
to gather diagnosis information in case of job execution errors. Both
the queue and the job error states result from a failed job execution.
Therefore the diagnosis possibilities are applicable to both types
of error states.
-
User
abort mail. If jobs are submitted with the qsub
-m a command, abort mail is sent to the address specified
with the -M user[@host] option. The abort mail contains diagnosis information about
job errors. Abort mail is the recommended source of information for
users.
-
qacct accounting. If no abort mail is available,
the user can run the qacct -j command. This command
gets information about the job error from the grid engine system's job accounting
function.
-
Administrator abort mail. An
administrator can order administrator mails about job execution problems
by specifying an appropriate email address. See under administrator_mail on the sge_conf(5) man page. Administrator
mail contains more detailed diagnosis information than user abort
mail. Administrator mail is the recommended method in case of frequent
job execution errors.
-
Messages files. If
no administrator mail is available, you should investigate the qmaster messages file first. You can find
entries that are related to a certain job by searching for the appropriate
job ID. In the default installation, the sge_qmaster messages file is sge-root/cell/spool/qmaster/messages.
You
can sometimes find additional information in the messages of the sge_execd daemon from which the job was started. Use qacct
-j job-id to discover the host
from which the job was started, and search in sge-root/cell/spool/host/messages for the job ID.
Troubleshooting Common Problems
This section provides information to help you diagnose and respond
to the cause of common problems.
-
Problem —
The output file for your job says, Warning: no access
to tty; thus no job control in this shell....
-
Possible cause —
One or more of your login files contain an stty command.
These commands are useful only if a terminal is present.
-
Possible solution —
No terminal is associated with batch jobs. You must remove all stty commands from your login files, or you must bracket such
commands with an if statement. The if statement
should check for a terminal before processing. The following example
shows an if statement:
/bin/csh:
stty -g # checks terminal status
if ($status == 0) # succeeds if a
terminal is present
<put all stty commands in here>
endif
|
-
Problem —
The job standard error log file says `tty`:Ambiguous.
However, no reference to tty exists in the user's
shell that is called in the job script.
-
Possible cause — shell_start_mode is, by default, posix_compliant.
Therefore all job scripts run with the shell that is specified in
the queue definition. The scripts do not run with the shell that is
specified on the first line of the job script.
-
Possible solution —
Use the -S flag to the qsub command,
or change shell_start_mode to unix_behavior.
-
Problem —
You can run your job script from the command line, but the job script
fails when you run it using the qsub command.
-
Possible cause —
Process limits might be being set for your job. To test whether limits
are being set, write a test script that performs limit and limit -h functions. Run both functions interactively, at
the shell prompt and using the qsub command, to
compare the results.
-
Possible solution —
Remove any commands in configuration files that sets limits in your
shell.
-
Problem —
Execution hosts report a load of 99.99.
-
Possible cause —
The sge_execd daemon is not running on the host.
Possible solution —
As root, start up the sge_execd daemon on the execution
host by running the sge-root/cell/common/sgeexecd script.
-
Possible cause —
A default domain is incorrectly specified.
Possible solution — As the grid engine system administrator,
run the qconf -mconf command and change the default_domain variable to none.
-
Possible cause —
The sge_qmaster host sees the name of the execution
host as different from the name that the execution host sees for itself.
Possible solution —
If you are using DNS to resolve the host names of your compute cluster,
configure /etc/hosts and NIS to return the fully
qualified domain name (FQDN) as the primary host name. Of course,
you can still define and use the short alias name, for example, 168.0.0.1 myhost.dom.com myhost.
If you are not using DNS, make sure that
all of your /etc/hosts files and your NIS table
are consistent, for example, 168.0.0.1 myhost.corp myhost or 168.0.0.1 myhost
-
Problem —
Every 30 seconds a warning that is similar to the following message
is printed to cell/spool/host/messages:
Tue Jan 23 21:20:46 2001|execd|meta|W|local
configuration meta not defined - using global configuration
|
But cell/common/local_conf contains
a file for each host, with FQDN.
-
Possible cause —
The host name resolving at your machine meta returns
the short name, but at your master machine, meta with
FQDN is returned.
-
Possible solution —
Make sure that all of your /etc/hosts files and
your NIS table are consistent in this respect. In this example, a
line such as the following text could erroneously be included in the /etc/hosts file of the host meta:
168.0.0.1 meta meta.your.domain
The line should instead be:
168.0.0.1 meta.your.domain meta.
-
Problem —
Occasionally you see CHECKSUM ERROR, WRITE
ERROR, or READ ERROR messages in the messages files of the daemons.
-
Problem —
Jobs finish on a particular queue and return the following message
in qmaster/messages:
Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1
finished on host exechost
|
Then you see the following error messages in the execution host's exechost/messages file:
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory
"active_jobs/490.1" for reaping job 490.1
|
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory
"active_jobs/490.1": opendir(active_jobs/490.1) failed:
Input/output error
|
-
Possible cause —
The sge-root directory, which is automounted,
is being unmounted, causing the sge_execd daemon
to lose its current working directory.
-
Possible solution —
Use a local spool directory for your sge_execd host.
Set the parameter execd_spool_dir, using QMON or the qconf command.
-
Problem —
When submitting interactive jobs with the qrsh utility,
you get the following error message:
% qrsh -l mem_free=1G error: error: no suitable queues
|
However, queues are available for submitting batch jobs with
the qsub command. These queues can be queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G.
-
Possible cause —
The message error: no suitable queues results
from the -w e submit option, which is active by
default for interactive jobs such as qrsh. Look
for -w e on the qrsh(1) man
page. This option causes the submit command to fail if the sge_qmaster does not know for sure that the job is dispatchable according
to the current cluster configuration. The intention of this mechanism
is to decline job requests in advance, in case the requests can't
be granted.
-
Possible solution —
In this case, mem_free is configured to be a consumable
resource, but you have not specified the amount of memory that is
to be available at each host. The memory load values are deliberately
not considered for this check because memory load values vary. Thus
they can't be seen as part of the cluster configuration. You can do
one of the following:
-
Omit this check generally by
explicitly overriding the qrsh default option -w e with the -w n option. You can also
put this command into sge-root/cell/common/sge_request.
-
If you intend to manage mem_free as a consumable resource, specify the mem_free capacity for your hosts in complex_values of host_conf by using qconf -me hostname.
-
If you don't intend to manage mem_free as a consumable resource, make it a nonconsumable
resource again in the consumable column of complex(5) by using qconf -mc hostname.
-
Problem — qrsh won't dispatch to the same node it is on. From a qsh shell you get a message such as the following:
host2 [49]% qrsh -inherit host2 hostname
error: executing task of job 1 failed:
host2 [50]% qrsh -inherit host4 hostname
host4
|
-
Possible cause — gid_range is not sufficient. gid_range should
be defined as a range, not as a single number. The grid engine system assigns
each job on a host a distinct gid.
-
Possible solution —
Adjust the gid_range with the qconf -mconf command
or with QMON. The suggested range is as follows:
-
Problem — qrsh -inherit -V does not work when used inside a parallel
job. You get the following message:
cannot get connection to "qlogin_starter"
|
-
Possible cause —
This problem occurs with nested qrsh calls. The
problem is caused by the -V option. The first qrsh -inherit call sets the environment variable TASK_ID. TASK_ID is the ID of the tightly integrated
task within the parallel job. The second qrsh -inherit call
uses this environment variable for registering its task. The command
fails as it tries to start a task with the same ID as the already-running
first task.
-
Possible solution —
You can either unset TASK_ID before calling qrsh
-inherit, or use the -v option instead
of -V. This option exports only the environment
variables that you really need.
-
Problem — qrsh does not seem to work at all. Messages like the following
are generated:
host2$ qrsh -verbose hostname
local configuration host2 not defined - using global configuration
waiting for interactive job to be scheduled ...
Your interactive job 88 has been successfully scheduled.
Establishing /share/gridware/utilbin/solaris64/rsh session
to host exehost ...
rcmd: socket: Permission denied
/share/gridware/utilbin/solaris64/rsh exited with exit code 1
reading exit code from shepherd ...
error: error waiting on socket for client to connect:
Interrupted system call
error: error reading return code of remote command
cleaning up after abnormal exit of
/share/gridware/utilbin/solaris64/rsh
host2$
|
-
Possible cause —
Permissions for qrsh are not set properly.
-
Possible solution —
Check the permissions of the following files, which are located in sge-root/utilbin/. Note that rlogin and rsh must be setuid and owned by root.
-r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin*
-r-s--x--x 1 root root 19808 Sep 18 06:00 rsh*
-rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*
|
Note –
The sge-root directory also
needs to be NFS-mounted with the setuid option.
If sge-root is mounted with nosuid from
your submit client, qrsh and associated commands
will not work.
-
Problem –
When you try to start a distributed make, qmake exits
with the following error message:
qrsh_starter: executing child process
qmake failed: No such file or directory
|
-
Possible cause —
The grid engine system starts an instance of qmake on the
execution host. If the grid engine system environment, especially the PATH variable, is not set up in the user's shell resource file
(.profile or .cshrc), this qmake call fails.
-
Possible solution —
Use the -v option to export the PATH environment
variable to the qmake job. A typical qmake call
is as follows:
qmake -v PATH -cwd -pe make 2-10 --
|
-
Problem —
When using the qmake utility, you get the following
error message:
waiting for interactive job to be scheduled ...timeout (4 s)
expired while waiting on socket fd 5
Your "qrsh" request could not be scheduled, try again later.
|
-
Possible cause —
The ARCH environment variable might be set incorrectly
in the shell from which qmake was called.
-
Possible solution –
Set the ARCH variable correctly to a supported value
that matches an available host in your cluster, or else specify the
correct value at submit time, for example, qmake -v ARCH=solaris64
...
-
Problem — If the following job is assigned
to a SuSE Linux system, the -cwd option does not work
and the cwd-test is created on the user's home
directory.
qsub -l arch=lx24-x86 -cwd -b y -o /dev/null -e /dev/null -V "hostname | tee -a
cwd-test"
|
-
Possible cause —
On SuSE Linux systems, the /etc/csh.login file
includes the cd command to change its current working
directory to the user's home directory. Shells usually source system-wide
profile files. For (t)csh on Linux, the system-wide file is /etc/csh.login.
-
Possible solution –
Use the -noshell option. In this case, there is no
intermediate shell that can source the profile file, and therefore
neither ~/.login nor ~/.cshrc are
sourced.
When using a batch script, use the PWD environment
variable to change the current working directory inside the batch
script before executing any programs.
Another workaround is for users to use neither tcsh nor csh.
Because the cd command only appears in /etc/csh.login, the issue would not happen.