Chapter 1 Grid Engine Management Module
This manual contains information about accessing and using GEMM as well as managing
N1GE versions.
Using GEMM
The Grid Engine Management Module (GEMM) for the Sun™ Control Station
allows you to install and set up a grid. It also allows you to monitor the performance
of the hosts in the grid. This document explains the features and services available
through the Grid Engine control module.
GEMM supports the following operating systems and hardware platforms:
-
Solaris 9 and 10 on SPARC, x86, and x64 hardware platforms
-
RedHatLinux 7.3, RedHatLinux 8.0, RedHatLinux 9 on x86 hardware platforms
-
FedoraLinux 1, FedoraLinux 2, FedoraLinux 3 on x86, and x64 hardware
platforms
-
RedHatLinux 2.1 WS,RedHatLinux 2.1 AS, RedHatLinux 2.1 ES on x86 hardware
platforms
-
RedHatLinux 3WS, RedHatLinux 3 AS, RedHatLinux 3ES on x86, and x64
hardware platforms
-
SuSELinux 9.0 on x86 and x64 hardware platforms
-
JDS 1, JDS 2 on x86 hardware
The GEMM module allows you to:
-
manage different versions of N1 Grid Engine
-
install a master host on the grid
-
install additional compute hosts on the grid
-
monitor and diagnose the performance of the grid
-
uninstall the Grid Engine components from selected compute and access
hosts
-
uninstall the Grid Engine components from the master host and all
compute and access hosts
-
check the status of Grid Engine (from Station Settings > Active Monitor
on the SCS.)
-
configure the monitoring settings for the grid
The following sections describe each of these functions.
Note –
In Grid Engine terminology, compute hosts are called execution hosts.
Access hosts are called submit hosts.
Installing GEMM
GEMM is part of the Grid Engine Update 4 distribution and is included in the Gemm/tar/n1ge-6_0u4-gemm.tar.gz package. Use the following steps to install
GEMM.
-
Unpack the GEMM archive file.
-
Add GEMM to a Sun Control Station 2.2 system. See the Sun Control
Station documentation for instructions on adding a new management module.
Accessing GEMM
You access the GEMM features by clicking on the Monitor menu item on the Sun
Control Station main screen as shown in the following figure.
Figure 1–1 Monitor Main Page
Note –
In most of the short procedures in this chapter, the first step is to
click the Grid Engine item in the left menu bar and the second step is to click on
a sub-menu item. To reduce the number of steps in each procedure, the menu commands
are grouped together and shown in Initial Caps. Right-angle brackets separate the
individual items. For example, select Grid Engine > Settings means
to click Grid Engine in the left menu bar and then click the Settings sub-menu item.
Task Progress Dialog
When you launch a task (for example, when installing a master host or uninstalling
a host), a Task Progress dialog appears in the user interface (UI). This dialog has
a Status field indicating the current status of the task and a progress bar. When
the progress bar displays 100%, the task has completed.
Figure 1–2 Task Progress Dialog
If you want to perform another task in the UI while the current task is underway,
you can put the Task Progress dialog in the background. Simply click the Run Task
In Background button located below the progress bar.
To return to the Task Progress dialog, select Administration > Tasks on the left. The Task table appears. If the task is still underway, a status
message displays in the Duration column. Click on the progress-bar icon
in this column to re-display the Task Progress dialog for this task.
Once the task is complete and the progress bar displays 100%, two buttons appear
below the Task Progress dialog: Done and View Events.
-
To view the list of events associated with the completed task, click View Events. The Events For <Task> table appears.
If you then click the up-arrow icon in the top-right corner,
the Tasks table appears.
-
To return to the previous screen, click Done.
Managing Versions
GEMM allows you to upload one or more versions of N1 Grid Engine, and choose
which one to deploy on the grid. To do these tasks, click the Versions menu
item to display the Version Management page.
Figure 1–3 Version List Page
This page is where you can upload, modify and manage different versions of N1
Grid Engine software.
Note –
You can only deploy one version at any given time. If you wish to deploy
another version, you must first uninstall the grid from all the hosts.
Versions List Icons
In the Version list, each version has three icons:
-
The Minus icon removes a version and all its files.
-
The Modify icon lets you rename a version.
-
The Inspect icon lets you add or remove individual files from a version.
Adding Grid Engine Versions
The Version Management main page displays a list of versions currently available
on the SCS server. Initially, no versions are defined.
To define a version
Steps
-
Click the Add button which produces a dialog where you name a version.
This version name can be anything, as long as it does not contain any
non-whitespace or punctuation characters other than “-” and “_”.
-
After you name the version, click the Submit button.
Once
you submit the name, the Version list displays again with the newly-created version
present. You can add more versions at any time.
Figure 1–4 Add Version Dialog
Adding Files to a Version
You must first add files to a version before you deploy it to the grid. Adding
files consists of adding all N1 Grid Engine package files that are part of the given
version.
Package File Criteria
The following criteria apply to the package files:
-
GEMM requires N1 Grid Engine 6 Update 4 or later.
-
All package files must be in .tar.gz format.
Although N1 Grid Engine currently is made available in .pkg format
for Solaris, the .pkg format files cannot be used by GEMM.
-
For any given version, there must be the “common” package
for that version, as well as all the “bin” (binary) packages which support
the kinds of hosts in your grid. For example, if your grid consists of Solaris 9 SPARC
hosts as well as Solaris 10 and x64 hosts, then you must include in the version these
files:
-
<name>-bin-solaris-sparcv9.tar.gz
-
<name>-bin-solaris-x64.tar.gz
-
<name>-common.tar.gz
where -<name> is the name given to that version, for example n1ge-6_0.
-
In any given version, you cannot mix different update levels of N1
Grid Engine. All packages associated with a version must belong to the same update
level, for example n1ge-6_0. The only exception is is when you
deploy N1GE 6 patches, which will be described ***.
To Add Files
Steps
-
click the Inspect icon in the version list. You will see a list of files
currently contained in that version.
-
Clicking the Add button produces a dialog box where you can load version
files one at a time. You can also upload files from the local browser using the File
browser or from a remote URL.
Figure 1–5 Add Files Dialog
Note –
When you upload files from a remote URL, you can only specify a URL which
can be accessed from the SCS server directly without going through a proxy server.
You cannot specify a proxy server when using the Version Management web dialog. Please
see the documentation for the command-line equivalent of Version Management, gemmVersionMgmt.pl, to learn how to upload files using a web proxy.
To Remove Files
Steps
-
Select a file from the list of files.
-
Click on the Minus icon for that file.
Deploying a Patched Version of N1GE 6 With GEMM
N1GE software updates are made available through the mechanism of patch files.
You cannot use an N1GE patch alone; you must use it in conjunction with a full distribution
of N1GE software. When you install a patch, it replaces various files in the existing
full version.
There are two ways you can install N1GE patches:
-
Install patch files on a live N1GE grid already running an existing
full installation of N1GE 6. This procedure is described in the patch documentation
but is not supported by GEMM.
-
Install patch files at the same time as you install a fresh installation
of an original, full, N1GE software distribution. You can use this technique when
you are creating a new grid and want to install it with the latest N1GE updates. You
also can use this approach when you want to use N1GE with the latest updates and don't
mind getting rid of your old setup entirely (without worrying about saving old configurations
or maintaining jobs currently in the systems). GEMM can handle this procedure automatically
as described here.
-
Create a new version in the Version Manager of GEMM. Populate this
version with N1GE files from an original full version, just as if you were going to
deploy this version.
-
Get the desired patch files. When Sun Microsystems creates and releases
patch updates, these files are made available on the SunSolve website (http://sunsolve.sun.com). For each patch release, there is
one patch for the N1GE "common" package, as well as one patch for each architecture-specific
package. Get all the patch files necessary for your particular environment.
Patch files are distributed in both .pkg format as well
as .tar.gz format. Make sure to obtain only the .tar.gz form of the patches. These patch files are themselves contained in a ZIP
archive; be sure to unzip the archive to extract the .tar.gz files.
-
Put these .tar.gz patch files into the previously
created version, using either the Version Manager web UI or the command line. Now,
You can use GEMM to deploy this version onto any Grid host, just as with an original,
unpatched version of N1GE.
Be sure that only one patch level of N1GE is
deployed across the grid. You should take care should to avoid mixing different patch
levels in the same distribution. Also do not use patch files for only some but not
all of the architecture-specific packages required for your environment.
Installing Grid Engine Hosts
To set up a compute grid, you must first select one of the managed hosts to
be the master host. You can then set up additional compute hosts.
Note –
After you install the N1GE6 software on a server and add hosts to the
grid using GEMM,, all the N1GE daemons on the hosts will be running, but you must
submit jobs separately.
For documentation on the N1GE6 software, refer to the user manuals at the following
URL:
http://docs.sun.com/db/coll/1017.3
Installing a Master Host
You can configure only one managed host to be the master host. If you have already
configured a master and you select the Install Master sub-menu item, a message appears
that you have already configured a master for the compute grid.
The Grid Engine module deploys only a dedicated N1GE6 master
host. Unless you plan to have relatively low job throughput on your grid, you should
not have the N1GE6 master host also act as a compute host. To add a host as a master
host in the compute grid, you must first import the host into the SCS 2.2 framework.
For more information, see “About Adding Managed Hosts” in the SCS 2.2
Release Notes.
Note –
The SCS server cannot server as an N1 Grid Engine (N1GE) Master Host or
Compute Host, since only SCS clients can have those roles. An SCS server cannot also
be an SCS client at the same time. Thus, the SCS server has to be a different host
than either the N1GE Master Host or Compute Hosts.
To install a master host for the grid:
Steps
-
Select Grid Engine > Install Master.
The selector appears, displaying the list of managed hosts; see Installing Grid Engine Hosts.
-
Click to highlight the managed host that you want to configure as the
master host in the compute grid.
-
Pick a Version from the list presented
The version picked
at this step will be installed on the Master as well as all the hosts in the grid
-
Click Install in the bottom right corner.
The Task Progress dialog appears.
Figure 1–6 Install Master Dialog
Installing Compute and Access Hosts
Once you have configured one of the managed hosts as the master host, you can
add additional hosts to act as compute hosts or access hosts
in the grid.
Note –
To add a host as a compute host in the grid, you must first import the
host into the Sun Control Station framework. For more information, see “About
Adding Managed Hosts” in the SCS 2.2 Product Notes.
Note –
Before you can add a compute host to a grid, you must first designate
a master host. If you have not yet designated a master host, the system instructs
you to do so. For more information see, Installing Grid Engine Hosts.
Steps
-
Select Grid Engine > Install Compute Host.
The selector appears, displaying the list of managed hosts; see the previous
figure.
-
Click to highlight a host(s). You can also click Select All at the top to choose all hosts in the list.
You
can pick the host(s) to be either compute or access hosts. Pick the desired button
at the bottom of the page labelled “Install Compute Hosts” or “Install
Access Hosts”
Figure 1–7 Install Compute Hosts Dialog
The Task Progress dialog appears. When the installation completes, a new dialog
box appears which allows you to either finish the installation or view the installation
events. If you choose View Events, a dialog similar
to the following appears.
Figure 1–8 View Events Dialog
-
When you are finished installing hosts, click Done.
Monitoring the Grid
When you click the Monitor Grid menu item, a page with a high-level overview
of the state of the grid appears.
This page has tables that allow you to:
Buttons on the main page let you go to pages where you can:
-
View Job Details
-
View Queue Details
-
View Host Details
-
Examine Daemon Log files
Also available from the SCS menu is the ability to quickly see the state of
the Grid by choosing Station Settings >Active Monitor.
Viewing Summary Status
Figure 1–9 Summary Status Table
The Summary Status table shows the total number of jobs in various states (pending,
running, suspended, and so forth). It also shows the load averaged across all compute
hosts and the total amount of used and installed memory summed over all compute hosts.
Updating Data
The subheading of this table contains a timestamp for when the data was obtained.
By default, most monitoring data is automatically refreshed every minute. To display
the most up-to-date database information in the tables, click the Monitor Grid menu
item again. You can also reload the browser window. If the monitoring is not working
properly for any reason, the subheading displays a warning and displays the timestamp
for when the data was most recently obtained. This timestamp applies to all monitoring
information displayed in GEMM, not just the Summary Status table.
Above this table is the Update button. Clicking this button retrieves the data
immediately instead of waiting for the next one-minute interval. A progress bar shows
the progress of the update. When the update completes, click the Done button to return
to the main Monitor Grid page with the new data and updated timestamp.
If an update of the monitor is already in progress when you click the button,
a message indicate this situation. As soon as the update in progress completes, the
Update button will again be available to force a new update.
Viewing Jobs
You access the Jobs details page by clicking the Jobs button in the Summary
Status table on the main Monitor page.
This page has a table which shows a summary of all current jobs in the system
including jobs which are pending, running, suspended, held, or in an error state.
Completed jobs are not listed. The top row of three buttons lets you see the list
of jobs according to three different views: Overview, Utilization, and Allocation.
The initial view is always the Overview. Clicking any of the other buttons displays
the other corresponding views. In all views, the back button on the table leads back
to the main page.
Also present in all views at the bottom of the frame is the Filter, which you
can use to limit the jobs displayed by providing configured criteria. Finally, the
three buttons corresponding to the three different views are always shown at the top
of each view, allowing you to move directly among the three views.
Using the Overview View
The Overview view shows an overall summary of the jobs.
Figure 1–10 Jobs Overview Page
The columns in this table provide the:
-
Job state, indicated by one or more letters plus a colored circle
and icon
-
Job ID
-
Job name
-
User who submitted the job
-
Project under which the job was submitted
-
Department of the submitter
-
Priority of the job
-
Job time, either the time spent pending, or for running jobs the time
spent running
-
Job task ID; for pending jobs, all task IDs are grouped together
Interpreting the Data
The icon scheme for the job state is:
-
A Gray Icon means the job is pending
-
A Green Icon means the job is running
-
A Yellow icon indicates the job is suspended
-
A Red Icon indicates that the job is in an error state
The letters shown for the job state are the same letters used by N1 Grid Engine to
indicate the job state when you run the qstat command. For more information, see the N1 Grid Engine Administration manual.
Sorting Rows
Jobs display ten rows at a time. You can see the entire list by using the pagination
controls at the bottom of the table. By default, rows are displayed numerically by
job ID, but you can use any column whose header is white to change the ordering of
the rows. Clicking on a column header sorts the rows according to the values in that
column. Clicking again on the column header reverses the sort. The sorting is preserved
across pages if you click on a pagination button.
Job Details
Clicking the Inspect icon next to the ID of each job retrieves details about
the job. A progress bar indicates the progress of this process. When the Done button
appears, clicking it leads to a page with the details displayed for the chosen job.
These details appear in three tables.
Figure 1–11 Job Details Page
-
The first table shows the job details, including various properties
related to the jobs environment, resource requests, submit options, and so forth.
-
The second table shows the current resource utilization for that job.
If this information is not available, for example, because the job started too recently
or the job is still pending, then this table is empty. For jobs with multiple tasks,
the usage of each task appears on a separate line.
-
The third table shows the scheduling information for that job.
The information displayed in these three tables corresponds directly to the
output from the N1 Grid Engine 6 qstat -j command. For more information
on job details, see the N1 Grid Engine 6 Administration manual.
Clicking
the Back button of the first table returns you to the Overview page.
Using the Utilization View
You access the Utilization view of the job by clicking the Utilization button
on the Jobs page.
Figure 1–12 Jobs Utilization View
Unlike the Overview view, only running and suspended jobs appear. In the Utilization
view, the columns are the:
-
Job state, indicated by a colored circle and icon
-
Job ID
-
Job Name
-
Queue instance where the job is running
-
CPU utilization of the job
-
Memory utilization of the job
-
Calculated share
-
Run time
-
Normalized Ticket priority
-
Normalized Urgency priority
-
Normalized POSIX priority
-
Job task ID; tasks belonging to the same job are never grouped
Note –
If the CPU usage or memory usage values are blank, the usage information
for that job has not yet been reported. Check back at a later time to see if the usage
is then reported.
The description for the Overview page regarding the meaning of the icons for
the job state is the same for this view, except that no letters are shown. The pagination
of the table and the sorting based upon different columns all apply similarly to the
Utilization View.
Job Diagnostics
An Inspect icon for the job Task ID is displayed for all jobs above the final
column. Clicking this icon retrieves the current diagnostic information for that job.
This diagnostic information corresponds to the data found in the job spool files in
the jobs spool directory. A progress bar indicates the progress of this process. When
the Done button appears, clicking it leads to a page with the status information displayed
for the chosen job as in the following figure.
Figure 1–13 Job Diagnostic Details
Note –
You can only obtain job diagnostic information if the job is running on
a compute host that was deployed by GEMM. If the host on which the job is running
was not deployed by GEMM, then clicking the Inspect icon results in an error message;
clicking Done leads back to the Utilization view.
The Job Diagnostic details given in these tables include:
-
addgrpid
-
config
-
environment
-
error
-
exit status
-
job_pid
-
pe_hostfile
-
pid
-
trace
Interpreting the Tables
Each table corresponds
to a different file from the job spool directory. For more information on the information
in the job spool directory, see the N1 Grid Engine 6 Administration manual.
Clicking the back button of the addgrpid table returns
you to the Utilization view.
Note –
If a job has already completed by the time you click the Inspect button,
or if the job completes during the information retrieval process, the information
is lost and cannot be displayed. In this case, the progress bar will indicate a failure
and clicking on the Done button leads back to the Utilization view.
Using the Allocation View
Clicking the Allocation button switches to the Allocation view of the jobs.
Figure 1–14 Jobs Allocation View
In this view, information is presented for all jobs and the columns provide
details for the:
-
Job state, indicated by a colored circle and icon
-
Job ID
-
Job name
-
Total number of tickets for the job
-
Number of override tickets
-
Number of functional tickets
-
Number of share tree tickets
-
POSIX priority
-
Total urgency for the job
-
Resource contribution to the urgency
-
Deadline contribution to the urgency
-
Waiting time contribution to the urgency
The description
of the icons for the job state on the Overview page apply here also, except that no
letters are shown. The pagination of the table, and the sorting based upon different
columns all apply similarly to the Allocation View.
For more information
on the meaning of each column, see the N1 Grid Engine Administration manual.
Filtering Jobs
In each of the three views Overview, Utilization, and Allocation, the Filter
option appears below the job table.
Figure 1–15 Filter Dialog
You use the filter to limit the jobs displayed to those matching a specified
search condition. The filter lets you choose a column on which to filter, a search
type to use, and a value on which to search.
You select the column and search type from a drop-down table, while you type
the value into a text entry box. The drop-down table for column changes with each
view depending on which columns are being displayed. The type of search can be one
of: equals, not equals, less than, less than or equal to, greater than, and greater
than or equal to.
You can define up to three filters at one time; the effects of multiple filters
are combined together to provide the final result. After you set up the desired filter,
click the Filter button to redisplay the current view with the filter applied. Pagination
is still active and will maintain the filter across pages . Clicking the Clear button
restores the unfiltered view.
The following figure shows you how a sorted jobs utilization page would look.
Figure 1–16 Filter Sorted Page
Note –
When you choose the Job State as a search column, the search value is
compared against the job status letter code as displayed in the Overview view, even
though these letters are not displayed for the Utilization and Allocation view.
Viewing Queue Details
You access the Queue Details page by clicking the Queue button in the Summary
Status table on the main Monitor page.
Figure 1–17 Queue Details Page
Note that this table provides information on all queue instances on the currently
selected master host, including instances on hosts that were not added by GEMM framework.
The information appears in groups of ten rows at a time, with the ability to page
back and forth between the rows.
Interpreting Data
For each queue instance, there are columns for the Queue instance name, the
status, the total number of slots and number of used slots. The status is indicated
by a colored circle and icon similar to the Job Alerts previously described. The only
additional feature is a green icon to indicate queue instances that have no alert
conditions. Clicking the Back icon in the table header returns you to the Monitor
Grid main page.
Sorting Data
By default, rows display alphabetically by queue instance name but you can use
any column whose header is written in white to change the ordering of the rows. Clicking
on a column header sorts the rows according to the values in that column; clicking
again on the column header reverses the sort. The sorting is preserved across pages
if you click a pagination button.
Viewing Additional Details
The final column of each row has an Inspect icon. Clicking on this icon displays
a table with the full details for that queue instance. The final entry in this table
shows the timestamp when the data was obtained. For information on the meaning of
the other table entries, consult the N1 Grid Engine 6 Administration manual. Clicking
on the 0 icon for this table returns you to the Queue Details page.
Viewing Host Details
You access the Host Details page by clicking the Host button on in the Summary
Status table on the main Monitor page.
Figure 1–18 Host Details View
This page displays a table with the state of all the compute hosts that are
members of the grid. The title of the table also indicates which host is currently
chosen as the Proxy Host.
Note that this table has information on all compute hosts reporting to the currently-chosen
master host, including those that were not added by GEMM framework.
Interpreting Data
The information appears in groups of ten rows at a time, with the ability to
page back and forth between the rows. For each host, there are columns for the Hostname,
Architecture, Load per CPU, Memory in use, Total Memory, and Swap Space in use. The
status is also indicated by a colored circle and icon similar to the Host Alerts table
with an additional green icon to indicate hosts that have no alert conditions. Clicking
the Back icon in the table header returns you to the Monitor Grid main page.
Sorting Data
By default, rows display alphabetically but you can use any column whose header
is white to change the ordering of the rows. Clicking on a column header sorts the
rows according to the values in that column; clicking again on the column header reverses
the sort. The sorting is preserved across pages if you click a pagination button.
Seeing Additional Details
The final column of each row has an Inspect icon. Clicking on this icon displays
a table where full details for that host appear. The final entry in this table shows
the timestamp when the data was obtained. For information on the meaning of the other
table entries, please consult the N1 Grid Engine 6 Administration manual. Clicking
the Back icon on this table returns you to the Host Details page.
Viewing Grid Engine Daemon Logs
You access the Grid Engine Daemon Logs page by clicking the Daemons button on
in the Summary Status table on the main Monitor page.
Figure 1–19 Grid Engine Daemons Log View
The Logs page contains a table which displays the names of all compute hosts
that were deployed by GEMM, plus the name of master host if it was deployed by GEMM.
Two additional columns are also shown. The first column, labeled Master, contains
an Inspect icon for the master host. The second column, labeled execd, contains an
Inspect icon for each compute host. Clicking these icons lets you retrieve the actual
log message files.
Note –
If the master host was not deployed by GEMM, no host in the table will
have the Inspect icon for the Qmaster column. Similarly, if there are compute hosts
that were not deployed by GEMM, these hosts will not appear in this table. Clicking
the Back icon in the table header returns you to the Monitor Grid main page.
Retrieving Log Message Files
Figure 1–20 Example Log Message File
Clicking an inspect icon retrieves and displays the qmaster and execd daemon messages file for the corresponding host. A progress bar indicates
the progress of this process. When the Done button appears, clicking it displays the
contents of the chosen messages file with each line appearing in its own row in a
table. Rows display 25 at a time with the ability to page through them.
The rows display in reverse chronological order, so that the most recent message
appears at the top of the list. Clicking on the Back icon for this table returns you
to the Grid Engine Daemon Logs page. For more information on daemon messages, see
the N1 Grid Engine 6 Administration manual.
Interpreting Messages
The first column of this table shows a colored circle and icon to indicate the
severity of that message. A green circle indicates a message of type Info. A yellow
circle indicates a message of type Warning or Critical. A red circle indicates a message
of type Error. The second column shows the time stamp for the message and the third
column shows the actual text of the message.
Viewing Cluster Queues
Figure 1–21 Cluster Queues Page
This table shows a summary of the state of all the cluster queues configured
on the grid, indicating the numbers of slots in various states. For information on
cluster queues, see the N1GE 6 Administration Guide.
Viewing Host Alerts
Figure 1–22 Host Alerts Page
This table shows all hosts where the threshold for either the load or memory
has been crossed. There are two types of alerts each indicated by a different colored
circle and icon.
A warning alert is indicated by a yellow icon. This alert displays if the load
goes above the load warning threshold or the memory goes below the memory warning
threshold.
A critical alert is indicated by a red icon. This alert displays if the load
goes above the load critical threshold or the memory goes below the memory critical
threshold.
The Host Alerts table is empty if no hosts have crossed any threshold. You configure
the values for the load and memory warning and critical thresholds on the Settings
page.
Viewing Queue Alerts
Figure 1–23 Queue Alerts Page
This table shows queue instances that are not in the usual running state. There
are three types of alerts each indicated by a different colored circle and icon.
-
A red icon indicates the queue instance is in either the Unknown or
Error state.
-
A yellow icon indicates the queue instance is in either an Alarm or
Suspended state.
-
A gray icon indicates the queue instance is in a Disabled state.
The exact state of the queue instance is also given in the Status column. For
more information on queue instance states, see the N1 Grid Engine 6 Administration
Manual.
Viewing Job Alerts
Figure 1–24 Job Alerts Page
This table displays grid jobs which are not in the usual running state. There
are two types of alerts each indicated by a different colored circle and icon.
-
A red icon indicates the job is in an Error state.
-
A yellow icon indicates the jobs pending time has exceed the pending
time threshold.
You configure the values for the pending time threshold
on the Settings page. For more information on job states, see the N1 Grid Engine 6
Administration manual.
Using Grid Active Monitor
You can quickly see the status of the Grid by using the SCS Active Monitor feature.
Choose Station Settings >Active Monitor. and scroll down the page to the Base Services
table shown in the following figure.
Figure 1–25 Grid Active Monitor Table
When the status of the grid changes due to an event like a queue alert, the
button next to the Grid Engine entry changes color in the following way:
-
Green: N1GE is up and running fine.
-
Yellow: the SCS cannot contact the proxy host or cannot obtain monitoring
information from it but it is still possible that the master is running.
-
Red: the proxy host indicates that the master is down.
-
Grey: N1GE is not installed anywhere.
Viewing Settings
When you click the Settings menu item a table displays with all the configurable
settings available in GEMM.
Figure 1–26 Settings Page
The parameters are grouped in four categories: Monitor Alert settings, N1GE
settings, NFS mount settings and Proxy settings.
Changing Monitor Alert Settings
These settings affect the display of alerts in the GEMM Monitor. All these parameters
must be set using decimal numbers. Any other type of input produces a formatting error.
Load Warning -- You use this parameter to specify the load warning threshold.
If this threshold is exceeded, a load warning alert appears in the Monitor. The value
is in terms of system load, as reported by the OS, divided by the number of CPUs.
Note –
Certain microprocessors with special features such as hyperthreading may
be registered as having more than one CPU per physical CPU socket, depending upon
factors such as the BIOS or PROM configuration.
Load Critical -- You use this parameter to specify the load critical threshold.
If this threshold is exceeded, a load critical alert appears in the Monitor. Similar
to the Load Warning parameter, you set this parameter in terms of the system load
scaled by number of CPUs.
Memory Warning -- You use this parameter to set the memory warning threshold.
If the value drops below this threshold, a memory warning alert appears in the Monitor.
You set the parameter value in terms of megabytes of free virtual memory.
Memory Critical -- You use this parameter to set the memory critical threshold.
If the value drops below this threshold, a memory critical alert appears in the
Monitor. You set the value in terms of megabytes of free virtual memory.
Maximum Job Pending Time -- You use this parameter to specify the amount of
time that a job spends pending after which a Job Pending alert appears in the Monitor.
You set the value in hours.
Note –
It is important that you set these five parameters to sensible values,
according to the characteristics of your particular grid. Otherwise, an excessive
number of alerts will appear on the Monitor main page, cluttering the display.
Changing N1GE Settings
The N1GE settings affect the way N1GE is installed onto the master, compute
and access hosts. The N1GE administrator must determine the various parameter values
suited to their local Grid environment.
Factors you should determine include the local namespace for users, TCP services,
file directory structure, operating system, and so forth. The values have default
options which are suitable for a generic installation. You should be familiar with
the N1GE 6 product before changing any of these values. If you wish to change more
advanced configuration settings, please see Chapter 3, Using the Setup
configuration file.
Once you deploy the master host, you cannot edit these values which remain in
effect for all further deployments of compute and access hosts. You can only edit
the values again if you uninstall the master host. The following section describes
each setting
SGE Root -- This setting is the root directory under which the N1GE files will
be installed. Note that the files will be installed on all hosts in this directory.
SGE Cell -- This settings is the N1GE cell name used for the deployment.
Qmaster TCP Port -- This setting is the TCP port to use for the N1GE qmaster
daemon.
Execd TCP Port -- This setting is the TCP port to use for the N1GE execd daemon.
Admin Username -- This setting is the username of the N1GE admin user.
Admin UID -- This setting is the UID of the N1GE admin user.
Grid Engine Version -- This parameter indicates the version of N1 Grid Engine
that will be deployed on the compute and access hosts.
Changing NFS Settings
These settings affect the way the N1GE “common” directory for the
chosen cell name is mounted on all access and compute hosts. The settings are described
as follows.
NFS Server Name -- The name of the NFS server from which all compute and access
hosts will mount the N1GE “common” directory. When you deploy the master
host using GEMM, this parameter is set automatically to the master host. Once you
deploy the master host you cannot edit this value and it remains in effect for all
further deployments of compute and access hosts. You can only edit the setting again
if you uninstall the master host.
NFS Mount Point -- The directory which is mounted from the NFS server for the
N1GE “common” directory. When deploying the master host using GEMM, this
is set automatically to <SGE_Root>/<SGE_Cell>/common,
where <SGE_Root> and <SGE_Cell> are the
values specified above. Once you deploy the master host you cannot edit this value
and it remains in effect for all further deployments of compute and access hosts.
You can only edit the setting again if you uninstall the master host.
Linux NFS Mount Options -- This setting is the options used when mounting the “common”
directory onto a Linux compute or access host. The value in this field is inserted
into the Linux /etc/fstab file on each host as:
<Servername>:<Mountpoint> <Mountpoint> nfs <Mountoptions> 0 0
where <Servername> and <Mountpoint> are
the values specified above and <Mountoptions> are the specified
Linux NFS mount options.
Note –
This parameter cannot contain any spaces
Solaris NFS Mount Options -- This setting specifies the options used when mounting
the “common” directory onto a Solaris compute or access host. The value
in this field is inserted into the Solaris /etc/vfstab file on
each host as:
<Servername>:<Mountpoint> - <Mountpoint> nfs -yes <Mountoptions>
where <Mountpoint> is the values specified above and <Mountoptions> is the specified Solaris NFS mount options.
Note –
This parameter cannot contain any spaces.
Changing the Proxy Host
Figure 1–27 Change Proxy Host Page
Currently, there is only one proxy setting, which indicates the host on which
monitoring commands are executed. If the master host has been previously deployed
using GEMM, then the proxy host is set to this host and cannot be changed until the
master is uninstalled. To choose the proxy host, click the Choose Proxy button at
the bottom of the page. A table of all the hosts on which the GEMM framework has been
installed. Select one host from this table.
Note –
The host you chose must be an N1GE admin host; otherwise, install and
uninstall of other hosts, as well as monitoring, could fail
Changing the N1GE Version
Figure 1–28 Change N1GE Version Page
To set N1GE version parameter, click the Choose Version button at the bottom
of the page. This action presents a table from which you select a version by clicking
its Inspect icon. The available versions are those uploaded in the GEMM Version management
page. If you deployed the master host previously using GEMM, the version chosen at
that time is displayed. Manual changes to this parameter are not allowed until you
uninstall the master host.
Special Consideration: External Master Host
You can use GEMM for deployment and monitoring even with an N1GE master host
not configured by GEMM. Possible scenarios include:
-
There is an already-existing N1GE installation.
-
You wish to deploy the master host on a platform not supported by
the Sun Control Station framework.
-
You need to install the master host in a configuration unsupported
by GEMM, such as with a shadow host, or with high-availability cluster via Sun Cluster
software.
-
If you have an externally-configured master host, you can still use
GEMM to deploy compute and access hosts, as well as for monitoring. However, you need
to follow these steps:
-
Collect the N1GE and NFS settings
-
Establish a Proxy Host
-
Deploy the Chosen Proxy Host as a Compute or Access
Collecting the N1GE and NFS settings
Once you have configured the master host and ensured that it is up and running
properly, take note of all the values for the N1GE settings as well as the NFS settings.
These settings are essentially the parameters you would use if you were to install
an execution host manually and associate with the master host, including the choice
of NFS options for mounting the N1GE common directory. For example, you might mount
the common directory from the master host or you may need to mount it from a separate
file server system or appliance. Note that the correct choice of the NFS settings
for the N1GE common directory is a critical step, since the common directory contains
a file which tells the compute and access hosts where to find the master host.
Part of this step is to ensure that the exact same version of N1 Grid Engine
which is running on the master host has been uploaded to GEMM using the Version page.
N1 Grid Engine will not function properly unless the same version, including update
level is used on the master host and all compute and access hosts.
Once you have determined and set the N1GE and NFS settings, it is important
not to modify them again. Otherwise, further compute and access host deployments could
be corrupted and will not work.
Establishing a Proxy Host
In order for GEMM to deploy additional compute and access hosts and perform
monitoring, you must choose a host from the Sun Control Station as an N1GE Admin Host.
This host must remain as an N1GE admin host as long as GEMM is in use. You may choose
a system which will be a compute host as well or you may choose a system which will
only be an access host. This choice is determined by factors such as:
-
Security concerns about a compute host having admin privileges ---
this factor depends upon your established policy for using N1GE.
-
Concerns about monitoring being impacted by compute tasks --- by default
the monitoring command runs once a minute on the chosen host, which probably will
not have a large impact unless the host is running a very resource-intensive job.
-
Permanence - The host you choose must be one which you do not expect
to take down ever during the course of running GEMM, otherwise monitoring and deployments
will not work
Once you have decided which host to make the admin host,
then perform these steps:
-
Set this host as the proxy host as previously described.
-
On the master host, add this host to the list of admin hosts. You
can add the host using the N1GE GUI or add it from the command line by using the N1GE qconf -ah command.
Deploying the Chosen Proxy Host as a Compute or Access
At this point, click the Install Host menu item, select the chosen proxy, and
install it as a compute or access host. You must select only the chosen proxy host
and no other host in this step. You must wait to deploy additional hosts until the
proxy host has been successfully established.
Uninstalling Hosts
From the Grid Engine main page you have two uninstall choices. You can uninstall
a particular host or hosts or you can uninstall everything.
To Uninstall Hosts
You can remove one or more compute hosts from the compute grid.
When you uninstall a compute host, the N1GE software is shut down and removed
from the selected hosts. The N1GE master host is instructed to remove those compute
hosts from the N1GE compute grid.
Note –
Before you start the uninstall procedure, ensure that no jobs are running
on the compute hosts that you want to uninstall. Any jobs that are currently running
on these hosts will be terminated. If the jobs are marked as “re-runnable”,
they are automatically resubmitted to the N1GE compute grid for execution on another
compute host(s). However, if they are marked as “not re-runnable,” then
they are not rescheduled and are not automatically executed elsewhere.
Steps
-
Select Grid Engine > Uninstall Host.
The selector appears, displaying the list of hosts currently in the compute
grid; see Uninstalling Hosts.
-
Click to highlight a host(s). You can also click Select All at the top to choose all hosts in the list.
-
Click Uninstall Selected Nodes in the
bottom right corner.
The Uninstall Task Progress Dialog appears.
Figure 1–29 Select Nodes to Uninstall Page
To Uninstall Everything
You can remove all components of the Grid Engine module from the master host
and all compute hosts.
Before you uninstall everything, be aware that:
-
all jobs (both running and suspended) are killed
-
all pending jobs are lost
-
all configurations and all records of previously run jobs are lost.
To uninstall everything:
Steps
-
Select Grid Engine > Uninstall Everything.
A screen appears, explaining the Uninstall Everything feature.
Figure 1–30 Uninstall Completely Dialog
-
Click Uninstall Master Host and ALL Compute Hosts.
The Task
Progress dialog appears.