Chapter 5 Working With N1 Grid Engine
Jobs
Each application running on the grid is considered a job.
The following sections describe how you can check a job's state as well as
it's utilization of resources and it's scheduling policy. This information
is displayed in different views of a jobs data including and overview, a utilization
view, and an allocation view. You can also see fine-grained information about
each job including details about each job's composite tasks.
Checking a Job's State
Use the Jobs Overview tab as a quick way to check a job's State and see some of the factors that might affect its performance. Clicking a job ID displays
a Job Details page that provides very detailed information.
Figure 5–1 Jobs Overview Tab
The fields on the Job Overview tab include:
-
State –
The Job state is indicated by the following letters:
-
d (deletion) — Indicates that a job has been deleted (using qdel(1)).
-
r (running) — Indicates that a job is about to be executed
or is already executing
-
R (restarted) — Indicates that the job was restarted. This
state can be caused by a job migration or because of one of the reasons described
in the -r section of the qsub man
page.
-
s (suspended) — Shows that an already running job has been
suspended (using qmod(1)).
-
S (suspended) — Show that an already running job has been
suspended because the queue that it belongs to has been suspended.
-
t (transferring) — Indicates that a job is about to be executed
or is already executing.
-
T (threshold) — Show that an already running job has been
suspended because at least one suspend threshold of the corresponding queue
was exceeded (for more information, see the queue_conf man
page) and that the job has been suspended as a consequence.
-
w (waiting) — Indicates that the job is suspended pending
the availability of a critical resource or specified condition.
See the qstat(1)man page for a detailed
explanation about these state conditions. For more information, you can also
see Monitoring and Controlling Jobs and Queues in the N1
Grid Engine User manual.
-
ID –
The job ID provides a unique identity for the job and also a method of accessing
the Job Details page.
-
Name –
The name of the job. Assigning names to jobs makes them more comprehensible
and easier to track than just relying on job IDs.
-
User –
The name of the user who submitted the job.
-
Project –
The name of the project to which the job is assigned as specified in the qsub(1)
-P option or by the default project of the submitting user.
-
Department –
The name of the department to which the user belongs. Use the -sul and -su options of qconf command
to display the current department definitions).
-
Priority –
The dispatch priority of the job determining its position in the pending jobs
list. The dispatch priority is a decimal number with higher values denoting
higher priority. The priority value is determined dynamically based on the
ticket and urgency policy setup.
-
Running
Time/Pending Time – The time that has elapsed since the job
started running or, for the case jobs that are still in the queue, how long
the job has been waiting to run.
-
Task –
The currently executing task. Some jobs consist of a single task (the task
ID is always 1.). However, parallel jobs and array jobs each consist of more
than one task. The tasks are usually numbered in ascending order starting
with 1. Depending upon how the job was submitted, sometimes the numbers might
skip, 1,3,5. On running jobs, each task runs distinctly and so has its own
configuration information, environment, and trace. For details about the task,
click the task number to display the Task Details page.
The Job User, Project, and Department are elements that you
can use in an Entitlement policy (also known as a Ticket policy) to affect
a job's dispatch priority. For example, jobs from one Department can always
be entitled to have a higher dispatch priority than those from another Department.
Dispatch Priority is computed from three top-level scheduling policies: Entitlement,
Urgency, and Custom (also known as POSIX) . For more detailed information
on N1GE scheduling policies and dispatch priority, see the sge_priority man
page and Scheduler Policies for Job Prioritization in
the Sun N1 Grid Engine 6 System (www.sun.com/blueprints/1005/819-4325.html).
Checking Grid Resources
Use the Job Utilization View tab to display information that
is relevant to a job's consumption of a grid computing resources as well as
other elements that factor into a job's dispatch priority. Unlike the Overview view, only running and suspended jobs appear. In the
Utilization view, the columns are as follows:
Figure 5–2 Job Utilization View Tab
-
State –
The Job State is indicated by the following letters:
-
d (deletion) – Indicates that a job has been deleted (using qdel).
-
r (running) – Indicates that a job is about to be executed
or is already executing
-
R (restarted) – Indicates that the job was restarted. This
can be caused by a job migration or because of one of the reasons described
in the -r section of the qsub(1) command.
-
s (suspended) – Shows that an already running job has been
suspended (using qmod(1))..
-
S (suspended) – Show that an already running job has been
suspended because the queue that it belongs to has been suspended.
-
t (transferring) – Indicates that a job is about to be executed
or is already executing.
-
T (threshold) – Show that an already running job has been suspended because at least
one suspend threshold of the corresponding queue was exceeded (see queue_conf(5)) and that the job has been suspended as a consequence.
-
w (waiting) – Indicates that the job is suspended pending
the availability of a critical resource or specified condition.
See the qstatman page for a detailed explanation
about these state conditions. For more information, you can also see Monitoring
and Controlling Jobs and Queues in the N1 Grid Engine
User manual.
-
ID –
The job ID provides a unique identity and also a method of accessing the Job
Details page.
-
Name –
The name of the job. Assigning names to jobs makes them more comprehensible
and easier to track than just relying on job IDs.
-
Queue –
The queue instance which this the job belongs to.
-
CPU –
The amount of CPU time that the job has consumed.
-
Memory –
The amount of memory that the job is using.
-
Share –
The calculated share of the total system to which the job is entitled currently.
-
Run
time – The length of time the job has been running since
it was dispatched.
-
NTickets –
The normalized Ticket priority. You can use the Override component of the
ticket policy to increase the entitlement of a specific User, Project, or
Department. By assigning Override Tickets, you can modify the entitlement
without affecting any prioritization assignments of the Urgency policy.
-
NUrgency –
The normalized Urgency priority. Three factors contribute to this priority:
the deadline contribution, the wait-time contribution, and the resource requirement
contribution.
-
NPOSIX –
The normalized POSIX priority. An administrator can use this value to arbitrarily
increase the priority of certain jobs.
-
Task –
The currently executing task. Some jobs consist of a single task, in which
case, the task ID is always 1. However, parallel jobs and array jobs each
consist of more than one task. The tasks are usually numbered in ascending
order starting with 1. Depending upon how the job was submitted, sometimes
the numbers might skip, (1,3,5,). On running jobs, each task runs distinctly
and so has its own configuration information, environment, and trace. For
details about the task, click the task number to display the Task Details page.
Note –
If the CPU usage or memory usage values are
blank, the usage information for that job has not yet been reported. Check
back at a later time to see if the usage is then reported.
For more information on the meaning of each column, see the QMON man page.
Normalized Priorities
The normalized ticket, urgency, and POSIX priorities are the
three top level policies used by the N1GE Scheduler to determine a job's dispatch
priority. Each calculate a factor that contributes to the overall priority.
In order for these three policy contributions to be added together in a meaningful
way, they are each normalized to a number between 0 and 1.
Checking Scheduling Policies
With the Job Allocation View tab, you can see information about the factors that constitute scheduling
policies that contribute to the dispatch priority that a job enjoys. You can
use this view to determine whether your priority policies are actually in
effect and to troubleshoot the components that determine an job's overall
priority in the queue.
A job's priority is determined based on three policies:
-
Ticket policy
-
Custom (or POSIX) policy
-
Urgency policy
The first part of the equation, Tickets, tells you the calculations
that the scheduler is making in order to implement the entitlement-oriented
scheduling policy that has been configured. Tickets provide a window into
the inner logical workings of the scheduler. This feature helps you to verify
that whatever policy you wanted is in fact being obeyed. It also provides
you with a means for diagnosing any problems or unexpected behavior you might
be seeing.
From a high level, the number of tickets assigned to a job
is directly proportional to the job's entitlement. The higher the number,
the greater the entitlement. Jobs with a large entitlement often have a high
priority, however, the overall priority is affected by the other two aspects
as well unless you have deliberately turned off the urgency and custom policies
In that case, only the entitlement ("tickets") policy is active.
The second part of the priority equation is Custom (also called
POSIX) priority. An administrator can use this value to arbitrarily increase
the priority of certain jobs.
The third part of the priority equation, Urgency, accounts
for only the job's individual characteristics, not its owner. The urgency
value is derived from the sum of three contributions: the deadline contribution,
the wait-time contribution, and the resource requirement contribution.
For more detailed information on N1GE scheduling policies
and dispatch priority, see the sge_priority man page and Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System (www.sun.com/blueprints/1005/819-4325.html).
Figure 5–3 Job Allocation View Tab
The Job Allocation View page displays the following information:
-
State –
The Job State is indicated by letters, specifically:
-
d (deletion) – Indicates that a job has been deleted (usingqdel(1)).
-
r (running) – Indicates that a job is about to be executed
or is already executing
-
R (restarted) – Indicates that the job was restarted. This
can be caused by a job migration or because of one of the reasons described
in the -r section of the qsub(1) command.
-
s (suspended) – Shows that an already running job has been
suspended (using qmod(1)).
-
S (suspended) – Show that an already running job has been
suspended because the queue that it belongs to has been suspended.
-
t (transferring) – Indicates that a job is about to be executed
or is already executing.
-
T (threshold) – Show that an already running job has been
suspended because at least one suspend threshold of the corresponding queue
was exceeded (see queue_conf(5)) and that the job has been
suspended as a consequence.
-
w (waiting) – Indicates that the job is suspended pending
the availability of a critical resource or specified condition.
See the qstatman page for a detailed explanation
about these state conditions. For more information, you can also see Monitoring
and Controlling Jobs and Queues in the N1 Grid Engine
User manual.
-
ID –
The job ID provides a unique identity and also a method of accessing the Job
Details page.
-
Name –
The name of the job. Assigning names to jobs makes them more comprehensible
and easier to track than just relying on job IDs.
-
Tickets –
T he total number of tickets for the job. The more tickets a job has assigned
to it, the higher that job's priority. This value is the “raw”
number before it is normalized.
-
Override –
The number of Override tickets. By assigning Override tickets, you can modify
the entitlement without affecting any prioritization assignments of the Urgency
policy.
-
Func –
The number of functional tickets.
-
Tree –
The number of share tree tickets. The share tree defines the long-term resource
entitlements of users/projects and of a hierarchy of arbitrary groups made
up of them.
-
Posix –
The POSIX priority. This feature provides a way to increase a job's priority.
This is the “raw” number before it is normalized.
-
Urgency –
The total urgency for the job made up of the deadline contribution, the wait-time
contribution, and the resource requirement contribution. This is the “raw”
number before it is normalized.
-
Res –
The resource contribution to the urgency
-
Wait –
The waiting time contribution to the urgency.
-
Ddln –
The deadline contribution to the urgency.
-
Task –
The currently executing task. Some jobs consist of a single task in which case, the task ID
is always 1. However, parallel jobs and array jobs each consist of more than
one task. The tasks are usually numbered in ascending order starting with
1. Depending upon how the job was submitted, sometimes the numbers might skip
like 1,3,5. On running jobs, each task runs distinctly and so has its own
configuration information, environment, and trace. For details about the task,
click the task number to display the Task Details page.
Note –
You can see the normalized values for Tickets,
POSIX, and Urgency using the Job Utilization View tab.
For more information on the meaning of each column, see the qmon man page.
Seeing Detailed Job Information
You can see complete details about a job by selecting the job ID on any of the job views tabs. The
Job Details page that appears presents this information in three tables: General,
Usage Details, and Schedule Details.
The General table provides details including various properties
related to the jobs environment, resource requests, submit options, and so
forth.
Figure 5–4 Job Details Page
The Usage Details table shows the current resource utilization
for that job. If this information is not available, for example, because the
job started too recently or the job is still pending, then this table is empty.
For jobs with multiple tasks, the usage of each task appears on a separate
line.
The Schedule Details table shows the scheduling information
for that job.
Most of the fields on this page are self-explanatory. For
more information, see the qstat man page.
Seeing Detailed Task Information
The Task Details page contains four tables that provide detailed
information about the selected task. This one details page contains information
for each task that appears in the three job views tabs. All the information
on this page is useful for diagnosing jobs that might be experiencing some
kind of problem or issue.
Figure 5–5 Task Details Page
This Task Details page contains tables of information that
correspond to a different file from the job spool directory. For more information
on the information in the job spool directory, see the N1 Grid Engine 6 Administration
manual. The tables are:
-
Task Summary
-
Configuration
-
Environment
-
Trace
Task Summary Table
The Task Summary table tells you basic information about the
job task.
-
Add
Group ID — Contains one line with the additional group ID
used to control and monitor the job.
-
PE Hostfile — A file describing the host setup of a parallel job which
contains each involved host, the queues the job was spooled into, and the
number of reserved slots (tasks) per host.
-
Error —
Contains an error message in the case of severe errors during the startup
of a job. For example, Execd cannot start shepherd.
-
Shepherd
PID — The process ID of the shepherd.
-
Job
PID — The process ID of the job (the shepherd's child process).
-
Exit
Status — The numeric exit code of the job in a single line.