Chapter 8 Troubleshooting N1 Grid Engine
This chapter tells you how to use the various alerts and the N1 Grid
Engine daemon logs to troubleshoot a grid.
Using N1 Grid Engine Daemon Logs
You use the N1 Grid Engine Daemon Logs page to see a historical view
of all the messages logged by the various N1 Grid Engine daemons. To see the
log file for a particular host, click its host name. To see the log files
for the system hosting the queue, click on a name in the QMASTER column.
Figure 8–1 Daemon Logs List Page
The log file for a particular host contains fields for a Flag, a Time
Stamp, and a Message. The flag tells you what kind of message was logged.
Flags exist for the following message types:
-
N (notice) – for
informational purposes
-
I (info) – for informational
purposes
-
W (warning)
-
E (error) – An error
condition has been detected
-
C (critical) – Which
can lead to a program abort
Use the loglevel parameter in the cluster configuration
to specify on a global basis or a local basis what message types you want
to log.
Troubleshooting Queues
You can use the information on the Queue Alerts page to troubleshoot
any queue problems. You access this page from the Alerts table on the Overview
page. Queue alerts are generated when the Queue Resource Limit parameters
defined using the queue_conf command are exceeded.
Figure 8–2 Queue Alerts List Page
The three types of queue alerts are:
-
Warnings – When resource
limits are exceeded, a warning can be generated before a queue is disabled.
-
Errors – Errors are
generated when a queue makes an invalid request.
-
Disabled – After
receiving a set number of warnings, queues are aborted after the notification
time defined in the queue configuration parameter notify has passed.
The Queue states are:
-
a (alarm) – At least
one of the load thresholds defined in the load_thresholds list
of the queue configuration is currently exceeded. This state prevents N1GE
from scheduling further jobs to that queue. For more information, see the queue_conf) man page.
-
A (Alarm) – At least
one of the suspend thresholds of the queue is currently exceeded. This state
causes jobs running in that queue to be successively suspended until no threshold
is violated. For more information, see the queue_conf man
page.
-
c (configuration ambiguous) –
The queue instance configuration specified using sge_conf is
ambiguous. The state resolves when the configuration becomes unambiguous again.
This state prevents you from scheduling further jobs to that queue instance.
You can find detailed reasons why a queue instance entered this state in the sge_qmaster messages file. You can also see the reasons using the qstat command with -explain. For queue instances
in this state, the cluster queue's default settings are used for the ambiguous
attribute.
-
C (Calendar suspended) –
The queue has been disabled or suspended automatically using the N1GE calendar
facility. See the calendar_conf man page for more information.
-
d (disabled) – This
setting is assigned to queues and released using the qmod command.
Suspending a queue will suspend all jobs executing in that queue.
-
D (Disabled) – The
queue has been disabled or suspended automatically using the N1GE calendar
facility. See the calendar_conf man page for more information.
-
E (Error) – This
setting appears when the N1GE daemon (sge_execd) on that
host was unable to locate the sge_shepherd executable
on that host in order to start a job. Check that daemon's error log for information
how to resolve the problem. Enable the queue afterwards using the qmod command
with the -c option.
-
o (orphaned) – The
current cluster queue's configuration and host group configuration no longer
needs this queue instance. The queue instance is kept because unfinished jobs
are still associated with it. The orphaned state prevents you from scheduling
further jobs to that queue instance. It disappears from qstat output
when these jobs finish. To help resolve an orphaned queue instance associated
with a job, use the qdel command. You can revive an orphaned
queue instance by changing the cluster queue configuration so that the configuration
covers that queue instance.
-
s (suspended) – Assigned
to queues and released using the qmod command. Suspending
a queue suspends all jobs executing in that queue.
-
S (Subordinate) –
The queue has been suspend due to subordination to another queue. See queue_conf for details. When suspending a queue, regardless of the cause,
all jobs executing in that queue are suspended too.
-
u (unknown) – The
corresponding sge_execd(8) cannot be contacted.
Troubleshooting Hosts
You can see potential host problems from the Host Alerts page. This
page is available from the Alerts table on the Overview page.
Figure 8–3 Hosts Alerts List Page
The following host alert parameters can all be alarmed so that if they
pass a specified threshhold, an alert will be generated and appear on the
Overview Alerts table.
-
Load Per CPU – Shows
how efficiently the Host's CPU is being used. This parameter can be any positive
decimal number but is usually between zero and 2 or 3. Ideally, this number
should be close to 1. A smaller number could mean the host is under utilized,
and a larger number could mean the host is overutilized. The ideal value
depends on the workload that is being run. Only the local administrator can
really know the implications of the workload.
-
Used Mem. – The percentage
of total memory currently being used to execute jobs. If the used memory is
too close to the total memory, then the host could be in trouble. However,
if the workloads are tuned to fit in the server, then it could be perfectly
fine that the used memory is just under the total memory. In fact, this is
tunable. You can set the value at which the difference between these two parameters
triggers an alarm. So, in one case, a difference of less than 100 MB triggers
a warning, while in another case it could be at 25 MB.
-
Total Mem. – The
total amount of memory on this host.
-
Swap Used – The amount
of free swap space left on this host measured in MBs. In a well-architected
grid, the free swap space should never drop very far below its initial value.
It is possible that temporary drops in this value can be tolerated depending
on how the grid is architected. If this value goes close to zero, then the
host is in danger of failing completely.
-
Date/Time – The timestamp
for when the alert was generated.
Troubleshooting Jobs
You can view potential job problems from the Job Alerts page. This page
is available from the Alerts table on the Overview page. The Pending Time
and Deadline job alert parameters can be alarmed so that if the values pass
a specified threshold, an alert will be generated and appear on the Overview
Alerts table.
Figure 8–4 Job Alerts List Page
The Job Alerts page shows the following information:
-
Job ID – The unique
identifier for the job. Clicking on the Job ID brings you to the Job Details
page.
-
Task – The currently
executing task. Some jobs consist of a single task (in which case, the task
ID is always 1.) However, parallel jobs and array jobs each consist of more
than one task. The tasks are usually numbered in ascending order starting
with 1. Depending upon how the job was submitted, sometimes the numbers might
skip as in 1,3,5. On running jobs, each task runs distinctly and so has its
own configuration information, environment, and trace. For details about the
task, click the task number to display the Task Details page.
-
Job Name – The name
assigned to the job.
-
Pending time – How
long the job has been waiting to be assigned to a queue.
-
Deadline – The time
specified by which a job must start or generate an alarm.
See the qstat man page for more information about
alarms and thresh holds.