SPARCcluster HA Server Service Manual
  Rechercher uniquement dans ce livre
Télécharger cet ouvrage au format PDF

Troubleshooting Overview

2

2.1 SPARCcluster Architecture

A SPARCcluster HA Server is comprised of redundant, on-line components, which can continue system operation through failure, repair, and relocation of one assembly or device. To maintain a high level of availability, failed components should be replaced as soon as possible. Also, service precautions must be taken to maintain cluster operation while maintenance is being accomplished. See the section on "Maintenance Authorization."

2.2 Maintenance Authorization

The site system administrator must be contacted to prepare a node for maintenance and, after maintenance, to return the node to cluster membership. The procedures in this manual note points where the system administrator must be contacted. However, the equipment owner's administrative requirements supercede the procedures contained herein.

2.3 Troubleshooting a Remote Site

Use telnet to communicate with either node in a cluster.

2.4 Troubleshooting Flow

2.4.1 Takeover

The Solstice HA software allows one node to takeover when a critical hardware or software failure is detected. When a failure is detected, an error message is generated to the system console and, if required, the service provider is notified (depending upon the system maintenance contract). When a takeover occurs, the node assuming control becomes the I/O master for the disksets on the failed node and redirects the clients of the failed node to itself. The troubleshooting flow for a takeover is further depicted in Figure 2-1.

2.4.2 Switchover

Administrators can manually direct one system to takeover the data services for the other node. This is referred to as a switchover (refer to the SPARCcluster High Availability Server Software Administration Guide).

2.4.3 Failures Where There is no Takeover

For noncritical failures, there is no software takeover. However to continue to provide HA data services, troubleshooting should be accomplished in the following order:

CAUTION Warning - DO NOT plug a keyboard directly to a node system board. If a keyboard is plugged into a system board, it then becomes the default for console input, thus preventing input from the system administration workstation/terminal concentrator serial port. In addition, plugging a keyboard directly into a node system board while power is applied to the node sends a break signal to the Solaris operating system, just as if you had typed L1-A on the console.

  1. You will be contacted by the system administrator to replace a defective part, or to further isolate a system defect to a failed part.

  2. Request that the system administrator prepare the applicable assembly containing the defective part for service.

  3. Isolate fault to the smallest replaceable part.

  1. Shut down specific assembly containing defective part.

  2. Replace the defective part.

  3. Contact system administrator to return the repaired assembly to the cluster.

Graphique

Figure 2-1

2.5 Fault Classes/Principal Assemblies

SPARCcluster HA troubleshooting is dependent on several different principal assemblies and classes of faults. The fault classes and their associated assemblies are:
  • SPARCstorage Array faults

    · Data disk

    · Controller

    · Fibre Channel Optical Module

    · Fibre Channel SBus card

    · Fiber optic cables/interfaces

  • Node (SPARCcenter 2000 or SPARCserver 1000) faults

    · Boot disk

    · System board

    · Control board

    · Fibre Channel Optical Module

    · Fibre Channel SBus card

    · Fiber optic cables/interfaces

    · Client net Sbus card

    · Client net/connections

    · SunFastEthernet SBus card/interfaces (SunFastEthernet)

  • Terminal concentrator/serial connections faults
  • Software faults

    · Application program died

    · System crash (panic)

    · Hung system (lock up)

    · Cluster-wide failures

All troubleshooting begins at the system console. The console should be checked regularly, as should any other source of operator information. For example, the output of the hastat command should be checked regularly. For more information on the hastat command, refer to the SPARCcluster High Availability Software Administration Guide.

Error Messages/Symptoms

Table 2-1 lists error messages or symptoms together with the probable cause and troubleshooting reference.
Table 2-1 Error Message/Symptom
Troubleshooting
Probable Cause Cluster Service Reference
Reference
Nodes
eboots;.........SPARCcenter
Section 3.3, "Node Failures..SPARCcenter
e;..........2000 or
2000/SPARCserver 1000
esponse..SPARCserver
System Service Manual
1000
Private Net
eboots;.........SPARCcenter
Section 3.3, "Node Failures..SPARCcenter
e;..........2000 or
2000/SPARCserver 1000
esponse..SPARCserver
System Service Manual
1000
SunFastEthernet Section 3.4.1, "Private Net
SunFastEthernet Adapter
Failure (SunFastEthernet)"..User Guide
Client Network
Client net...Refer to your client network
As applicable
documentation
Public Network
Cable....(See Chapters 9 and 10 of the
Not applicable
SPARCcluster HA Hardware
Planning and Installatin Manual for cable detail.)
Table 2-1 (Continued)
Error Message/SymptomProbable CauseCluster Service Reference

SPARCstorage Array

Troubleshooting Reference
soc.link.5010 soc#: port: # Fibre Channel is OFFLINE; c2t4d8s2 failed (see Appendix D for additional messages)DiskSection 3.2, "SPARCstorage Array/Optical Connections Faults"

Terminal Concentrator

SPARCstorage Array Model 100 Series Service Manual
No messages for one of the nodes on the system console; no messages from either node on the system consoleTerminal concentratorSection 3.5, "Terminal Concentrator/Serial Connection Faults"

2.7 Device to Troubleshooting Cross Reference

Table 2-2 references devices to the appropriate troubleshooting manual.
Table 2-2
Device/Trouble AreaReferencePart Number
Array Controller/Fibre Optic Connector/
Fibre Channel Optical Module
SPARCstorage Array Model 100 Series Service Manual
(Chapter 2 "Troubleshooting")
801-2206
Disk driveSPARCstorage Array Model 100 Series Service Manual801-2206
Terminal concentratorSPARCcluster HA Server Service Manual (Section 3.5, "Terminal Concentrator/Serial Connection Faults")802-3512
SPARCcenter 2000SPARCcenter 2000 Service Manual (Chapter 2, "Troubleshooting Overview")801-2007
SPARCserver 1000SPARCserver 1000 System Service Manual (Chapter 2, "Troubleshooting Overview")801-2895
SunFastEthernet adapterSunFastEthernet Adapter User Guide (Appendix C, "Running Diagnostics")801-6109

2.8 Device Replacement Cross Reference

Table 2-3 refers to devices and replacement procedures.
Table 2-3
DeviceCross Reference Part Number


SPARCserver 1000 SPARCcenter 2000
Disk driveDisk Drive Installation Manual for the SPARCstorage Array801-2207801-2207
Optical ModuleFibre Channel Optical Module Installation Manual801-6326801-6326
SunFastEthernetSunFastEthernet Adapter User Guide801-6109801-6109
System board, control board, power supply, SPARC module, boot diskSPARCcenter 2000 or SPARCserver 1000 System Service Manual801-2007801-2895