SPARCcluster HA Server Software Administration Guide
  Search only this book
Download this book in PDF

Monitoring the Solstice HA Servers

3

This chapter tells how to use the Solstice HA and Solstice DiskSuite commands to monitor the behavior of a Solstice HA configuration.
Use the following table to locate specific information in this chapter.
Overview of Solstice HA Monitoringpage 3-1
Monitoring the Solstice HA Configuration Statuspage 3-2
Monitoring the Load of the Solstice HA Serverspage 3-4
Monitoring Metadevice Actionspage 3-5
Monitoring Metadevice State Database Replicaspage 3-6
Checking Message Filespage 3-8
Using SunNet Manager to Monitor Solstice HA Serverspage 3-8

3.1 Overview of Solstice HA Monitoring

You will use five utilities in addition to the /var/adm/messages files when monitoring the behavior of a Solstice HA configuration. The utilities you use include hastat(1M), haload(1M), metastat(1M), metadb(1M), and metatool(1M).

3.2 Monitoring the Solstice HA Configuration Status

hastat displays the current state of the Solstice HA configuration. The program displays status information about the hosts, logical hosts, private networks, public networks, data services, local disks, disksets, along with the most recent error messages.
An example of output from hastat is shown below:

  # hastat  
  Configuration State: Stable  
  logicalhost1 - Owned by host1  
  logicalhost2 - Owned by host2  
  
  host1 -   1:56pm  up 2 day(s),  4:54,  2 users,  load average: 0.12, 0.09, 0.07  
  host2 -   1:56pm  up 2 day(s), 5 hr(s),  0 users,  load average: 0.11, 0.09, 0.12  
  
  Data Service HA-NFS: logicalhost1 - Unknown; logicalhost2 - Ok  
  
  Local metadevices: host1 - (none); host2 - (none)  
  Local metadb replicas: host1 - Ok; host2 - Ok  
  Diskset logicalhost1: Ok; MetaDB replicas in logicalhost1: Ok  
  Diskset logicalhost2: Ok; MetaDB replicas in logicalhost2: Ok  
  
  Private nets: Ok  
  Public nets: host1 - Ok; host2 - Ok  
  
  Extract of Message Log (examine /var/adm/messages for the full context):  
  Aug  9 09:11:14 host1 hadf: ERROR: nfs_mounttouchfile: Failed for host2:/palmer/root/whynot  
  with exit status 1  
  . . .  
  #  

Figure 3-1 Example hastat Output
The status is reported as follows:
  • Ok - The component's status is okay.
  • Not Ok - The component is not functioning. For instance, no public networks are responding.
  • Degraded - The component is working well enough to provide partial service to some clients, but needs some repair.
  • Unknown - There is not enough information about the component to determine the status. For instance, when the sibling host is down, the remaining host will list the private nets as Unknown.
The following list explains the output displayed:
  • Solstice HA configuration state - Either Down, Reconfiguring, or Stable. Down says the Solstice HA configuration is not functioning. The string Reconfiguring is displayed when the Solstice HA configuration is in the process of a transition from one state to another, because of a takeover or switchover. Stable says the server is functioning as expected.
  • Logical hosts - The names of the logical hosts associated with the two disksets along with the name of the current owner, or the string Maintenance mode if the logical host is currently in the Maintenance state (taken down by an administrator).
  • Physical servers - The names of both physical servers in the Solstice HA configuration are displayed with the current time, the length of time the server has been up (in days and hours), the number of users, and the load average over the past 1, 5, and 15 minutes.
  • Status of data services - The data services running on which of the logical hosts. For HA-NFS the status is represented as OK, Not OK, or Degraded. For HA-ORACLE the status of each database is reported as running, maintenance, not configured correctly, or stopped. If a data service is not running on a logical host, that logical host is not listed for that data service. Not Ok indicates the data service has failed. If the status is Not OK or Degraded, check the Message Log or the message file (/var/adm/messages) to see if an error has been reported.
  • Local metadevices - The status of local Solstice DiskSuite metadevices, reported as Ok, Not Ok, or Unknown. If the status is Not Ok, you should first check the Message Log or messages file (/var/adm/messages) to see if an error has been reported. If one has not, run the metastat(1M) command to discover the problem. If the local file systems are not on metadevices, this field displays a status of none.
  • Local metadb replicas - The status of the metadevice state database replicas on the local disks, reported as Ok or Not Ok. If the status is Not Ok, one or more database replicas are inactive. Run the metadb(1M) command for additional information.
  • Disksets - The status of the multi-host disksets reported as OK, Not OK, or Unknown. If the status is Not OK, you should first check the Message Log or message file to see if an error has been reported. If one has not, run the metastat -s diskset command to discover the problem. If hastat cannot determine the status it is reported as Unknown.
  • Private networks - The status of private networks, displayed as either OK, Not OK, Degraded, or Unknown. A status of Not Ok or Degraded indicates a problem with one or both of the private network interfaces. You can check the Message Log or message file (/var/adm/messages) for additional information, or directly troubleshoot the interface for hardware or software faults using command such as ping(1M), swapping cables, or swapping controllers.
  • Public networks - The status of public networks, displayed as either OK, Not OK, Degraded, or Unknown. You must check the Message Log or message file (/var/adm/messages) for additional information if the status is Not OK or Degraded.
  • Recent error messages - The message log at the bottom of the display lists the last few messages from the /var/adm/messages file.

Note - Because the recent error messages list is a filtered extract of the log messages, the context of some messages may be lost. You should directly examine the /var/adm/messages file for a complete list of the messages.

3.3 Monitoring the Load of the Solstice HA Servers

haload is used to monitor the load on the pair of Solstice HA servers. Monitoring is necessary because there must be some excess capacity between the two servers. If there is no excess capacity and a takeover occurs, the remaining server may be unable to take care of the combined workload.
haload monitors both servers and logs occurrences of an overload. The administrator should take corrective actions to eliminate the possibility of an overload.
If an overload occurs, haload will exit with the special exit code 99.
haload may be invoked either automatically by Solstice HA or by the system administrator.

3.4 Monitoring Metadevice Actions

Metadevices can be monitored using the metastat command or the DiskSuite Tool (metatool(1M)). Complete details about the two commands can be found in the Solstice DiskSuite 4.0 Administration Guide and Solstice DiskSuite Tool 4.0 User's Guide.
By default, metastat prints information to the screen about all metadevices and hot spare pools that are in the local diskset on the local host. If you want to view diskset status, you must run the command on the server that owns the diskset. An example of the metastat command follows:

  # metastat -s logicalhost1  
  logicalhost1/d0: Trans  
      State: Okay  
      Size: 14182560 blocks  
      Master Device: logicalhost1/d125  
      Logging Device: logicalhost1/d122  
  
  logicalhost1/d125: Mirror  
      Submirror 0: logicalhost1/d127  
        State: Okay  
      Submirror 1: logicalhost1/d126  
        State: Okay  
      Pass: 1  
      Read option: roundrobin (default)  
      Write option: parallel (default)  
      Size: 14182560 blocks  
  
  logicalhost1/d127: Submirror of logicalhost1/d125  
      State: Okay  
      Hot spare pool: logicalhost1/hsp000  
      Size: 14182560 blocks  
      Stripe 0:  
          Device              Start Block  Dbase State        Hot Spare  
          c1t0d0s0                   0     No    Okay  
      Stripe 1:  
          Device              Start Block  Dbase State        Hot Spare  
          c1t1d0s0                   0     No    Okay  
      Stripe 2:  
          Device              Start Block  Dbase State        Hot Spare  
          c1t1d1s0                   0     No    Okay  
   ...  

Individual metadevice status can be viewed by specifying the name of the metadevice on the metastat command line. For instance:

  # metastat -s logicalhost1 d0  

DiskSuite Tool displays status of metadevices and hot spares several ways. The problem list window of the DiskSuite Tool contains a scrolling list of current metadevice problems (but not a history of problems). The list is updated each time DiskSuite Tool learns of a change in status. Each item on the list is given a time stamp.

3.5 Monitoring Metadevice State Database Replicas

Use the metadb command to monitor the status of the metadevice state database replicas that reside on both local disks and in disksets.
To display the status of replicas that reside on local disks, execute metadb on the server where the disks are connected.
Complete details about the metadb command are in the Solstice DiskSuite 4.0 Administration Guide.
You can also use the metatool utility to check the status of metadevice state database replicas. Refer to Chapter 10 of the Solstice DiskSuite Tool 4.0 User's Guide for details.
To display the status of replicas that reside on disks in a diskset, execute the command shown below. The -i option prints the information message at the bottom of the output. The setname used as an argument to metadb is the name of the logical host.

  # metadb -i -s setname  
        flags           first blk      block count  
       a m     luo        16              1034            /dev/dsk/c1t0d0s7  
       a       luo        1050            1034            /dev/dsk/c1t0d0s7  
       a       luo        16              1034            /dev/dsk/c1t1d0s7  
       a       luo        1050            1034            /dev/dsk/c1t1d0s7  
       a       luo        16              1034            /dev/dsk/c1t2d0s7  
       a       luo        1050            1034            /dev/dsk/c1t2d0s7  
       a       luo        16              1034            /dev/dsk/c1t3d0s7  
       a       luo        1050            1034            /dev/dsk/c1t3d0s7  
   o - replica active prior to last mddb configuration change  
   u - replica is up to date  
   l - locator for this replica was read successfully  
   c - replica's location was in /etc/opt/SUNWmd/mddb.cf  
   p - replica's location was patched in kernel  
   m - replica is master, this is replica selected as input  
   W - replica has device write errors  
   a - replica is active, commits are occurring to this replica  
   M - replica had problem with master blocks  
   D - replica had problem with data blocks  
   F - replica had format problems  
   S - replica is too small to hold current data base  
   R - replica had device read errors  
  #  

3.6 Checking Message Files

The Solstice HA software writes messages to the /var/adm/messages files in addition to reporting these to the console. The following is an example of the messages reported when a disk error occurs.

  ...  
  Jun 1 16:15:26 host1 unix: WARNING: /io-  
  unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49):  
  Jun 1 16:15:26 host1 unix: Error for command 'write(I))' Err  
  Jun 1 16:15:27 host1 unix: or Level: Fatal  
  Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559  
  Jun 1 16:15:27 host1 unix: Sense Key: Media Error  
  Jun 1 16:15:27 host1 unix: Vendor 'CONNER':  
  Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15  
  ...  


Note - Because Solaris and Solstice HA error messages are written to the /var/adm/messages file, the /var directory may become full. Refer to "Maintenance of the /var File System" on page 9-10 for the procedure to correct this problem.

3.7 Using SunNet Manager to Monitor Solstice HA Servers

You can use SunNet Manager(TM) and its agents to monitor Solstice HA configurations. SunNet Manager enables you to set up procedures to get information such as:
  • Ownership change
  • Status of private links
  • Host and network performance
This information can be presented in two ways:
  • Graphically, using SunNet Manager
  • Through custom scripts
  • Event monitors that watch SunNet Manager data for significant changes

Note - Some of the SunNet Manager agents may have an adverse affect on the Solstice HA services.

The SunNet Manager agents that have been identified useful and safe to use in a Solstice HA configuration include ping, hostif, hostmem, hostperf, and traffic.
Refer to the SunNet Manager documentation set for instructions on setting up the agents.

3.7.1 SunNet Manager Requirements

The following requirements apply to the use of SunNet Manager in Solstice HA configurations:
  • SunNet Manager should be installed on the workstation that you will be using to monitor the HA Servers.
  • The SunNet Manager libraries and agents should be installed on the local disks of the Solstice HA servers so the activities can be monitored.

Note - You can choose to have a SunNet Manager Console window that is closed to an icon open automatically when an event is received. You specify this in the Console's Properties window, available by clicking SELECT in the Props button in the Console's control area.

You can receive notification by the blinking and coloring effect of the glyph. You can also be notified by either mail(1) or by sending the output to your customized script.