Contained Within
Find More Documentation
Featured Support Resources
| Download this book in PDF
Hardware Replacement and Repair
4
- This chapter provides the necessary software instructions to use when replacing or repairing hardware components such as disks and cables.
- Use the SPARCcluster High Availability Server Service Manual, SPARCstorage Array Model 100 Service Manual, and the Solstice DiskSuite 4.0 Administration Guide with the information in this chapter.
- Use the following table to locate specific information in this chapter.
-
4.1 Recovering From a Power Loss
- Maintenance of SPARCcluster 1000 or SPARCcluster 2000 configurations includes handling such failures as power loss.
4.1.1 SPARCcluster 1000 Configuration Power Loss
- In SPARCcluster 1000 configurations there are two types of power loss scenarios that can occur. They are:
-
- The SPARCcluster 1000 configuration has a single power cord and a failure takes down both Solstice HA servers.
- The two SPARCserver 1000s and the three SPARCstorage Arrays have separate power cords.
4.1.1.1 Single Power Cord
- When power is lost to a SPARCcluster 1000 configuration in a single 56-inch data center expansion cabinet, the entire configuration will go down if power is supplied by a single cord. The two SPARCserver 1000 servers in the cabinet will reboot when power is restored.
- You should immediately run hastat and metastat to look for any error conditions that may have happened due to the power outage.
- When the reboot happens, one of the servers is likely to boot faster than the other and take ownership of both disksets if you have a symmetric configuration. You must run haswitch(1M) to reset the default diskset ownership.
- The terminal concentrator will boot slower than the SPARCserver 1000s, which means you may miss any messages that appear when the server boots. The terminal concentrator can be run from a separate outlet to ensure it is available as early in the power process as possible.
- If the server's power is not cabled correctly, one of the servers may reboot before the SPARCstorage Arrays and some disks may be invisible to Solaris 2.4. In this event, the server may reboot again and see the SPARCstorage Array when it comes up.
- You may need to use the instructions provided in Section 4.3.1, "Recovering From Power Loss," on page 4-20 for returning the multi-host disks to service.
-
Note - If any SPARCstorage Array is not ready at Solaris boot time, the associated disks will not be accessible. If this occurs, one or both servers must be rebooted.
4.1.1.2 Separate Power Cords
- If separate power cords are used on the two servers and the three SPARCstorage Arrays and you lost power to only one of the servers, the other server will detect the failure and initiate a takeover.
- When power is restored to the server that failed, it will boot, wait for the membership state to become stable, and rejoin the configuration. Both disksets will be owned by the server that did not fail. Perform a manual switchover to restore the default diskset ownership.
- If you lose power to one of the SPARCstorage Arrays, Solstice DiskSuite will detect errors on the affected disks and place the slices in error state. The SPARCstorage Array drivers will attempt to retry connections for up to one minute before reporting an error. Solstice DiskSuite mirroring will mask this failure from the Solstice HA fault monitoring. No switchover or takeover will occur.
- When power is returned to the SPARCstorage Array, you must perform the procedure documented in Section 4.3.1, "Recovering From Power Loss."
4.1.2 SPARCcluster 2000 Configuration Power Loss
- In SPARCcluster 2000 configurations there are several types of power loss scenarios that can occur. These include:
-
- The power to both SPARCcenter 2000s fails, taking down the entire configuration.
- The power to one SPARCcenter 2000 fails, taking down the server and one SPARCstorage Array.
- The power to one SPARCcenter 2000 fails, taking down the server, two SPARCstorage Arrays, and the terminal concentrator.
4.1.2.1 Total Configuration Failure
- If power to both servers in a SPARCcluster 2000 configuration fails, one of the servers may reboot faster than the other. If you have a symmetric configuration, the first server to reboot will take ownership of both disksets. In this event, you must return one of the disksets to the default master by using haswitch.
- You should immediately run hastat and metastat to look for any error conditions that may have happened due to the power outage.
4.1.2.2 Failure of a Server and One SPARCstorage Array
- If power is lost to one of the SPARCcenter 2000s and the SPARCstorage Array that is installed in the same cabinet, the other server will immediately initiate a takeover.
- When the power is restored, the server will reboot, rejoin the configuration and begin monitoring activity. You must manually run haswitch to give ownership of the diskset back to the server that had lost power.
- After the diskset ownership has been returned to the default master, any multi-host disks (submirrors, hot spares, and metadevice state database replicas) that reported errors must be returned to service. Use the instructions provided in Section 4.3.1, "Recovering From Power Loss," on page 4-20 for returning the multi-host disks to service.
4.1.2.3 Failure of a Server, Two SPARCstorage Arrays, and the Terminal
Concentrators
- If power is lost to one of the SPARCcenter 2000s and the two SPARCstorage Arrays that are installed in the same cabinet, either a Solstice DiskSuite panic will occur because there is a minority of metadevice state database replicas or the Solstice HA software will cause a panic.
- When any I/O is done to the disks in either of the two SPARCstorage Arrays, the problem will be noticed by Solstice DiskSuite. Briefly, DiskSuite will retry the I/O, then it will initiate a replica minority panic when it attempts to record the error status of the affected submirrors and discovers it has only a minority of replicas accessible.
- Possibly, the HA-NFS fault probing may observe the problem as slow response before Solstice DiskSuite actually receives the disk I/O error. In this case a takeover may be initiated and a panic will occur during diskset takeover when a minority of the replicas are accessible.
- The console message may not be visible if the terminal concentrator is also down.
- When power is restored, the SPARCcenter 2000 may reboot before the terminal concentrator. Thus, any errors reported when the SPARCcenter 2000 is rebooting must be viewed using dmesg(1M) or by looking in /var/adm/messages. Depending on the specifics of your configuration, manual intervention may be required to return the Solstice HA configuration to service.
4.2 Replacement of Failed Disks
- As part of standard Solstice HA administration, you should monitor the status of the configuration. See the instructions in Chapter 3, "Monitoring the Solstice HA Servers," for instructions on the monitoring methods.
- During the monitoring process you may discover problems with local and multi-host disks. The following subsections provide instructions for correcting these problems.
4.2.1 Overview of Multi-host Disk Replacement
- The procedures in the following subsection describe a method for replacing a multi-host disk without interrupting Solstice HA services (online replacement). Consult the Solstice DiskSuite 4.0 Administration Guide for offline replacement procedures.
- If a disk in a SPARCstorage Array must be replaced, you must first stop all I/O to the SPARCstorage Array tray containing the disk to be replaced. This is required so the tray can be spun down in preparation for drive replacement.
-
Caution - Before deleting the replicas and hot spares, you must make a record of the location (slice), number of replicas, and the hot spare information (names of the devices and all containing hot spare pools) so the actions can be reversed following the disk replacement.
- You must delete any metadevices state database replicas from the affected tray to prevent replica IO operations during the replacement procedure. You must also offline or detach submirrors on the affected tray to stop their I/O activity. Finally, available hot spare devices must be deleted from hot spare pools to prevent them from begin brought into service during the disk replacement procedure.
- When the I/O operations have been stopped the drives on the SPARCstorage Array tray can be spun down and the tray removed. The disk can be replaced and the tray returned to the array.
- Before using the replacement disk drive it must be partitioned to match the partitioning of the replaced disk. This can be done using format(1m) or fmthard(1m).
- Metadevices state database replicas are added back to the tray in the same locations and with the same counts using the metadb(1M) command. Offline mirrors are brought back online and brought up to date resyncing only those dirty regions of the submirrors (optimized resync) using metaonline(1M). Detached mirrors are attached and brought up to date with a submirror resync using metattach(1M) (this is expensive, but depending on specifics of submirror configuration is the only safe method). Finally the deleted hot spare devices are returned to their original hot spare pools using metahs(1M).
· How to Replace a Failed Multi-host Disk
- Replacement of a failed multi-host disk is a complex procedure. You should read the Section 4.2.1, "Overview of Multi-host Disk Replacement," before you begin.
-
Note - This procedure can be used if a submirror component is in maintenance state, hot spare replaced, or is generating intermittent errors.
- When metastat(1M) reports that a device is in maintenance state or some of the components have been replaced by hot spares, you must locate and replace the device. An example metastat output that shows device c3t3d4s0 is in maintenance state follows:
-
host1# metastat -s logicalhost1
...
d50:Submirror of logicalhost1/d40
State: Needs Maintenance
Stripe 0:
Device Start Block Dbase State Hot Spare
c3t3d4s0 0 No Okay c3t5d4s0
...
|
- To locate and replace the disk, perform the following steps:
-
-
Identify the disk to be replaced by examining /var/adm/messages and metastat output.
-
host1# tail -f /var/adm/messages
...
Jun 1 16:15:26 host1 unix: WARNING: /io-
unit@f,e1200000/sbi@0.0/SUNW,pln@a0000000,741022/ssd@3,4(ssd49):
Jun 1 16:15:26 host1 unix: Error for command 'write(I))' Err
Jun 1 16:15:27 host1 unix: or Level: Fatal
Jun 1 16:15:27 host1 unix: Requested Block 144004, Error Block: 715559
Jun 1 16:15:27 host1 unix: Sense Key: Media Error
Jun 1 16:15:27 host1 unix: Vendor 'CONNER':
Jun 1 16:15:27 host1 unix: ASC=0x10(ID CRC or ECC error),ASCQ=0x0,FRU=0x15
...
|
- Based on the above information and metastat output, it is determined that drive c3t3d4 must be replaced.
-
-
Locate the diskset that contains the affected drive.
Locate drive c3t3d4 by entering the following commands. Note that no output was displayed when the command was run with logicalhost2, but logicalhost1 reported that the name was present. In the reported output, the yes field indicates that the disk contains a metadevice state database replica.
-
host1# metaset -s logicalhost2 | grep c3t3d4
host1# metaset -s logicalhost1 | grep c3t3d4
c3t3d4 yes
|
-
-
Switch ownership of both logical hosts to one Solstice HA server using a command similar to the following:
-
host1# haswitch host1 logicalhost1 logicalhost2
|
- The SPARCstorage Array tray that contains c3t3d4 (the disk with the problem in this example) may also contain disks from both disksets. If this is the case, you must switch ownership of both disksets to the server where the ssaadm(1M) command will be used to spin down the disks.
-
-
Determine the location of the problem disk.
To find the SPARCstorage Array tray where the problem disk resides, run the ssaadm command.
-
host1# ssaadm display c3
SPARCstorage Array Configuration
Controller path: /devices/io-
unit@f,e1200000/sbi@0.0/SUNW,soc@0,0/SUNW,pln@a0000000,741022:ctlr
DEVICE STATUS
TRAY1 TRAY2 TRAY3
Slot
1 Drive:0,0 Drive:2,0 Drive:4,0
2 Drive:0,1 Drive:2,1 Drive:4,1
3 Drive:0,2 Drive:2,2 Drive:4,2
4 Drive:0,3 Drive:2,3 Drive:4,3
5 Drive:0,4 Drive:2,4 Drive:4,4
6 Drive:1,0 Drive:3,0 Drive:5,0
7 Drive:1,1 Drive:3,1 Drive:5,1
8 Drive:1,2 Drive:3,2 Drive:5,2
9 Drive:1,3 Drive:3,3 Drive:5,3
10 Drive:1,4 Drive:3,4 Drive:5,4
CONTROLLER STATUS
Vendor: SUNW
Product ID: SSA100
Product Rev: 1.0
Firmware Rev: 2.3
Serial Num: 000000741022
Accumulate performance Statistics: Enabled
|
- The ssaadm output for controller (c3) shows that Drive 3,4 (c3t3d4) is the closest to you when you pull out the middle tray.
-
-
Delete all hot spares that are have Available status and are in the same tray as the problem disk.
This includes all hot spares, regardless of their logical host assignment. In the following example, metahs reports the hot spares on logicalhost1, but none are present on logicalhost2. You should record all the information about the hot spares so they can be added back to the hot spare pools following the replacement procedure.
-
host1# metahs -s logicalhost1 -i
logicalhost1:hsp000 2 hot spares
c1t4d0s0 Available 2026080 blocks
c3t2d5s0 Available 2026080 blocks
host1# metahs -s logicalhost1 -d hsp000 c3t2d5s0
host1# metahs -s logicalhost2 -i
host1#
|
-
-
Delete any metadevice state database replicas that are on disks in the tray that must be pulled. You must keep track of this information because you must replace these replicas in Step 18.
There may be multiple replicas on the same disk. Make sure you record the number of replicas deleted from each slice.
-
host1# metadb -s logicalhost1
This command reports the replicas in diskset logicalhost1
host1# metadb -s logicalhost2
This command reports the replicas in diskset logicalhost2
host1# metadb -s logicalhost1 -d replicas_in_tray
host1# metadb -s logicalhost2 -d replicas_in_tray
|
-
-
Locate the submirrors that are using components that reside in tray 2.
One method to use would be to use the metastat command to create temporary files that contain the names of all metadevices. For instance:
-
host1# metastat -s logicalhost1 > /tmp/logicalhost1.stat
host1# metastat -s logicalhost2 > /tmp/logicalhost2.stat
|
- Search the temporary files for the c3t3dn and c3t2dn components. If you used the hasetup(1M) defaults (two non-reserved user slices per disk), there will be a maximum of 20 components (10 disks * 2 slices).
- The information in the temporary files will look like:
-
...
logicalhost1/d35: Submirror of logicalhost1/d15
State: Okay
Hot Spare pool: logicalhost1/hsp100
Size: 2026080 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c3t3d3s0 0 No Okay
logicalhost1/d54: Submirror of logicalhost1/d24
State: Okay
Hot Spare pool: logicalhost1/hsp106
Size: 21168 blocks
Stripe 0:
Device Start Block Dbase State Hot Spare
c3t3d3s6 0 No Okay
...
|
-
-
Detach all submirrors with components on the disk that is being replaced.
If you are detaching a submirror that has an errored component you must force the detach using the metadetach -f option.
-
host1# metadetach -s logicalhost1 d40 d50
|
-
-
Take all other submirrors that have components in tray 2 offline.
Using the output from the temporary files in Step 7, run the metaoffline command on all submirrors in tray 2.
-
host1# metaoffline -s logicalhost1 d15 d35
host1# metaoffline -s logicalhost1 d24 d54
...
|
- Run metaoffline as many times as necessary (maybe up to 20 times) to take all the submirrors offline. This forces Solstice DiskSuite to stop using the submirror components in tray 2 so that the spin down command can be issued.
-
-
Spin down all disks in tray 2 of the SPARCstorage Array.
-
host1# ssaadm stop -t 2 c3
|
-
Caution - The SPARCstorage Array tray should not be removed as long as the LED on the tray is illuminated. Also, you should not run any Solstice DiskSuite command while the tray is spun down as these may have the side effect of spinning up some or all of the drives in the tray.
-
-
Pull tray 2 and replace the bad disk.
Instructions for the hardware procedure are found in the SPARCstorage Array Model 100 Series Service Manual (part number 801-2206) and the SPARCcluster High Availability Server Service Manual.
-
Make sure all disks in tray 2 of the SPARCstorage Array spin up.
The disks in the SPARCstorage Array tray should automatically spin up following the hardware replacement procedure. If the tray fails to spin up automatically within two minutes, force the action by using the following command.
-
host1# ssaadm start -t 2 c3
|
-
-
Use format(1M) or fmthard(1M) to repartition the new disk. Make sure you partition the new disk exactly as the disk that was replaced. Saving the disk format information was recommended in Chapter 2, "Preparing for Administration."
-
-
Bring all submirrors that were taken offline in Step 9 back online.
-
host1# metaonline -s logicalhost1 d15 d35
host1# metaonline -s logicalhost1 d24 d54
...
|
- Running metastat at this time would show that all the metadevices with components that reside in the second tray need maintenance.
- When the submirrors are brought back online, Solstice DiskSuite will automatically perform resyncs on all the submirrors, bringing all the data back up to date.
- Run metaonline as many times as necessary (maybe up to 20 times) to bring all the submirrors online.
-
-
Attach submirrors that were detached in Step 8.
-
host1# metattach -s logicalhost1 d40 d50
|
-
-
Replace any hot spares in use in the submirrors attached in Step 15.
If a submirror had a hot spare replacement in use before you detatched the submirror, this hot spare replacement will be in affect after the submirror is reattached. This step returns the hot spare to Available status.
-
host1# metareplace -s logicalhost1 -e d40 c3t3d4s0
|
-
-
Add all hot spares that were deleted in Step 5.
-
host1# metahs -s logicalhost1 -a hsp000 c3t2d5s0
|
-
-
Add all metadevice state database replicas that were deleted from disks on tray 2.
Use the information saved from Step 6 to replace the metadevice state database replicas.
-
host1# metadb -s logicalhost1 -a deleted_replicas
|
-
-
Switch each logical host back to its default master.
-
host1# haswitch host2 logicalhost2
|
4.2.2 Overview of Local Disk Replacement
- In both the SPARCcluster 2000 and SPARCcluster 1000 configurations there are at least two local disks. One of the local disks is the boot disk which contains the Solaris operating environment. The other local disk contains your other local data.
- Replacement of the boot disk is a difficult procedure and is detailed in the "How to Replace a Failed Local Boot Disk" on page 4-13. The procedure in that section covers the replacement of the failed local disk that contains the Solaris operating environment. In this procedure, the local disk on host1 has failed.
-
Note - Depending on the severity of your boot disk failure, you may not be able to perform all the steps. If a server is already down, you may omit steps 1, 2, and 3.
- The procedure for replacing a local disk that is not the boot disk is covered in "How to Replace a Failed Local Non-Boot Disk" on page 4-17.
· How to Replace a Failed Local Boot Disk
-
Note - This procedure expects that you installed and configured the two Solstice HA servers identically. This allows you to copy the various configuration files from the sibling rather than using backup tapes.
-
-
Switch ownership of both logical hosts to one Solstice HA server using a command similar to the one shown.
If a takeover has been initiated by the sibling, it will not be necessary to run this command.
-
host2# haswitch host2 logicalhost1 logicalhost2
|
-
-
Shut down the Solstice HA services on the host with the failed local disk.
-
host1# /etc/init.d/SUNWhadf stop
|
-
-
Halt the server that has the failed local disk.
-
-
-
Delete the server from the disksets on the sibling server.
Each of these commands may take several minutes to execute.
-
host2# metaset -s logicalhost1 -f -d -h host1
host2# metaset -s logicalhost2 -f -d -h host1
|
-
-
Perform the disk replacement using the procedure in the SPARCcluster High Availability Server Service Manual.
-
Install the Solaris operating environment using the instructions in the SPARCcluster High Availability Software Planning and Installation Guide. Make sure you install the same software clusters (packages) that are installed on the sibling server.
-
Note - Select the Do Not Reboot option during installation. This will allow you to restore some files to the new root slice in /a before doing the reconfiguration reboot.
- During installation, numerous Disk Reserved messages will be displayed because the sibling host owns all the SPARCstorage Array disks. These messages can safely be ignored.
-
-
Edit the /a/etc/nsswitch.conf file, changing the host line to specify files first.
Later you can copy the /etc/nsswitch.conf file from the sibling, but the host line must specify files first for the procedure you are performing here to work.
-
-
Restore a copy of the /etc/path_to_inst file and place it in /a/etc/path_to_inst.
Because the two servers were installed identically, the file can be copied from the sibling.
-
Restore a copy of the /etc/system file and place it in /a/etc/system. This file contains modification for Solstice HA operation.
-
-
(Optional) Edit the /a/etc/default/login file to allow root logins on terminals other than the console.
-
-
Add all host names, all private host names, and all logical host names to the /a/etc/hosts file.
Refer to Chapter 7 of the SPARCcluster High Availability Software Planning and Installation Guide for instructions.
-
Configure the private networks. Run the ifconfig command as shown below.
-
host1# ifconfig be0 plumb
host1# ifconfig be0 host1-priv1 netmask + broadcast + -trailers up
|
- Refer to Chapter 7 of the SPARCcluster High Availability Software Planning and Installation Guide for additional information.
-
-
Copy the configuration files from the sibling.
Use the following commands to copy the network configuration files from the sibling.
-
host1# rcp -p host2-priv1:/etc/nsswitch.conf /a/etc
host1# rcp -p host2-priv1:/etc/syslog.conf /a/etc
host1# rcp -p host2-priv1:/etc/netmasks /a/etc
host1# rcp -p host2-priv1:/kernel/drv/md.conf /a/kernel/drv
host1# rcp -p host2-priv1:/.rhosts /a/.rhosts
host1# rcp -p host2-priv1:/.profile /a (optional command)
|
-
-
(Optional) Copy the appropriate entries from the old vfstab file from tape and make mount points for formerly mounted file systems.
-
(Optional) Restore the crontab file from backup tape.
-
(Optional) Enable core dumps in /a/etc/init.d/sysetup.
-
-
(Optional) Restore /a/etc/resolv.conf (if you are using DNS).
-
Reboot the server.
-
-
-
Install the Solstice DiskSuite and Solstice HA packages and recommended patches.
-
Restore the /etc/hostname.* files for both private and secondary public networks.
If these files do not exist, you must re-create them by using the instructions in Chapter 7 of the SPARCcluster High Availability Software Planning and Installation Guide.
-
Execute a reconfiguration reboot.
The reconfiguration reboot builds the appropriate device special inodes for Solstice DiskSuite. To execute a reconfiguration reboot, enter the following:
-
-
-
Add the three replicas back on slice 4 of the new boot disk.
In this example /dev/dsk/c0t0d0s4 is used.
-
host1# metadb -afc 3 /dev/dsk/c0t0d0s4
|
-
-
Add the server to the disksets.
Note that the following command are executed on host2.
-
host2# metaset -s logicalhost1 -a -h host1
host2# metaset -s logicalhost2 -a -h host1
|
-
-
Copy the Solstice HA configuration files from the sibling.
Enter the following commands from the sibling host (host2) to copy the appropriate files.
-
host2# cd /etc/opt/SUNWhadf/hadf
host2# rcp -p cmm_confcdb hadfconfig host1-priv1:/etc/opt/SUNWhadf/hadf
host2# rcp -p hafmconfig vfstab.* host1-priv1:/etc/opt/SUNWhadf/hadf
host2# cd /etc/opt/SUNWhadf/nfs
host2# rcp -p dfstab.* host1-priv1:/etc/opt/SUNWhadf/nfs
|
-
-
Run the hacheck(1M) command.
Instructions for using this command can be found in Chapter 11 of the SPARCcluster High Availability Software Planning and Installation Guide.
-
Create a hard link from /etc/rc3.d/S20SUNWhadf to /etc/init.d/SUNWhadf.
This link automatically starts Solstice HA when the server is brought up in multi-user mode. It will not automatically start Solstice HA when the server is brought up in single user mode.
-
host1# ln /etc/init.d/SUNWhadf /etc/rc3.d/S20SUNWhadf
|
-
-
Start the Solstice HA services.
-
host1# /etc/init.d/SUNWhadf start
|
-
-
Switch the logical host back to its default master.
-
host1# haswitch host1 logicalhost1
|
· How to Replace a Failed Local Non-Boot Disk
- This procedure covers the replacement of the failed local disk that does not contain the Solaris operating environment. In this example, host2 has the disk that failed.
-
-
Switch ownership of both logical hosts to the server that is not experiencing problems. Use a command similar to the following:
-
host1# haswitch host1 logicalhost1 logicalhost2
|
-
-
Shut down the Solstice HA services on the server that is having problems.
-
host2# /etc/init.d/SUNWhadf stop
|
-
-
Locate any local metadevice state database replicas that may have been placed on the problem disk. Use the metadb command to find the replicas.
Errors may be reported for the replicas located on the failed disk. In this example, c0t1d0 is the problem device.
-
host2# metadb
flags first blk block count
a m u 16 1034 /dev/dsk/c0t0d0s4
a u 1050 1034 /dev/dsk/c0t0d0s4
a u 2084 1034 /dev/dsk/c0t0d0s4
W pc luo 16 1034 /dev/dsk/c0t1d0s4
W pc luo 1050 1034 /dev/dsk/c0t1d0s4
W pc luo 2084 1034 /dev/dsk/c0t1d0s4
host2#
|
- The output shown above shows there are three metadevice state databases on slice 4 of each of the local disks, c0t0d0s4 and c0t1d0s4. The W in the flags field of the c0t1d0s4 slice indicates the device has write errors.
-
-
Make a record of the slice name where the replicas reside and the number of replicas, then delete the metadevice state databases.
The number of replicas is obtained by counting the number of appearances of a slice in metadb output in Step 3. In this example, we are deleting the three replicas that exist on c0t1d0s4.
-
host2# metadb -d /dev/dsk/c0t1d0s4
|
-
-
Shut down Solaris and turn off the server.
-
-
-
Perform the disk replacement using the procedure in the SPARCcluster High Availability Server Service Manual.
-
Turn the server on and reboot it in single user mode.
-
-
-
Repartition the new disk with the same slice information as the failed disk.
-
Run newfs(1M) on the new slices to create file systems.
-
-
-
Mount the appropriate file systems.
-
-
Restore data from backup tapes.
-
If you deleted replicas in Step 4, add the same number back to the appropriate slice.
In this example, /dev/dsk/c0t1d0s4 is used.
-
host2# metadb -ac 3 /dev/dsk/c0t1d0s4
|
-
-
Reboot the server.
-
-
-
When the host has rejoined the Solstice HA configuration (this usually takes about one minute), switch the logical host back to its default master.
-
host2# haswitch host2 logicalhost2
|
4.3 SPARCstorage Array Maintenance
- Maintenance of the SPARCstorage Arrays in a SPARCcluster 1000 or SPARCcluster 2000 configuration involves the following:
-
- Recovering from power loss
- Repairing a lost connection
- Replacing a failed SPARCstorage Array (changing the World Wide Name)
- Removing a SPARCstorage Array tray
- Replacing a SPARCstorage Array tray
- Replacing failed SPARCstorage Array components (disk, battery, backplane, controller, optical module, fan tray, or fibre channel cable)
4.3.1 Recovering From Power Loss
- When power is lost to one SPARCstorage Array, I/O operations to the submirrors, hot spares, and metadevice state database replicas will generate Solstice DiskSuite errors. The errors are reported at the slice level rather than the drive level. Errors are not reported until I/O operations are made to the disk. Hot spare activity may be initiated if affected devices have assigned hot spares.
- You must monitor the configuration for these events using hastat(1M) and metastat(1M) as explained in Chapter 3, "Monitoring the Solstice HA Servers."
- When power is restored, you will use the metastat command to identify the errored devices. Errored devices are returned to service using the command:
-
# metareplace -s logicalhost -e metamirror component
|
- The -e option transitions the state of component to the available state and resyncs the failed component.
-
Note - Components that have been replaced by a hot spare should be the last devices replaced using the metareplace command. If the hot spare is replaced first, it could replace another errored submirror as soon as it becomes available.
- A resync can be performed on only one component of a submirror (metadevice) at a time. If all components of a submirror were affected by the power outage, each component must be replaced separately. It takes approximately 10 minutes for a resync to be performed on a 1.05-Gbyte disk.
- If both disksets in a symmetric configuration were affected by the power outage, a resync can be run on the affected submirrors concurrently by logging into each host separately and running metareplace.
- Depending on the number of submirrors and the number of components in these submirrors, the resync actions can require a considerable amount of time. A single submirror that is made up of 30 1.05-Gbyte drives might take about five hours to complete. A more realistic configuration made up of five component submirrors might take only 50 minutes to complete.
- After the loss of power, all metadevice state database replicas on the affected SPARCstorage Array chassis will enter an errored state. While these will be reclaimed at the next takeover (haswitch or reboot(1M)) you may want to manually return them to service by first deleting and then adding them back as metadevices. Because metadevice state database replica recovery is not automatic, it is safest to manually perform the recovery immediately after the SPARCstorage Array returns to service. Otherwise, a new failure may cause a majority of replicas to be out of service and cause a kernel panic. This is the expected behavior of Solstice DiskSuite when too few replicas are available.
-
Note - Make sure you add back the same number of replicas that were deleted on each slice. Multiple replicas can be deleted with a single metadb command. It may require multiple invocations of metadb -a to add back the replicas deleted by a single metadb -d. This is because if you need multiple copies of replicas on one slice these must be added in one invocation of metadb using the -c flag.
4.3.2 Repairing a Lost Connection
- When a connection from a SPARCstorage Array to one of the hosts fails, the failure is probably due to either a fiber optic cable or a SBus FC/S or FC/OM cards.
- In either event, the host on which the failure occurred will begin generating errors when the failure is discovered. This takes about one minute. Later accesses to the SPARCstorage Array will generate additional errors. The host will exhibit the same behavior as though power had been lost to the SPARCstorage Array.
- In symmetric configurations, I/O operations from the other host to the SPARCstorage Array are unaffected by this type of failure.
- To diagnosis the failure, inspect the SPARCstorage Array's display. The display will show whether the A or B connection has been lost.
- To replace the cable, use the following procedure. In this example, the connection to host2 from one SPARCstorage Array must be replaced.
· How to Repair a Lost Connection
-
-
Replace the failed cable.
Refer to the SPARCstorage Array Model 100 Series Service Manual for detailed instructions.
-
Recover from Solstice DiskSuite errors as described in Section 4.3.1, "Recovering From Power Loss."
Solstice DiskSuite cannot detect whether the loss of power is due to a failed SPARCstorage Array or a power loss.
4.3.3 Changing a SPARCstorage Array World Wide Name
- Some SPARCstorage Array failures may make it necessary to replace the entire chassis.
- These failures can be caused by a faulty controller or other reasons. To guard against this type of failure, all metadevices have been set up with only one submirror of a mirror on a SPARCstorage Array chassis. Thus, loss of data will not occur with this type of failure.
- The SPARCstorage Array controller has a unique identifier known as the World Wide Name (WWN). The WWN is like the host ID stored in the host IDPROM of a desktop SPARCstation. The last four digits of the SPARCstorage Array WWN are displayed on the LCD panel of the chassis. The WWN is part of the /devices path associated with the SPARCstorage Array and its component drives.
- When you replace the SPARCstorage Array chassis in a Solstice HA configuration, you can change the WWN of the replacement chassis to be that of the chassis you are replacing. This may be easier than reconfiguring Solstice DiskSuite.
- If the SPARCstorage Array controller or the entire chassis must be replaced, the Solstice HA servers will discover the new WWN when they are rebooted. This confuses the identity of disks within a diskset. To avoid this potential confusion, the WWN of the new controller can be changed to the WWN of the old controller. (This is similar to swapping the IDPROM when replacing a System Board in a desktop SPARCstation.)
· How to Change a SPARCstorage Array World Wide Name
-
-
Determine the symbolic link value of the SPARCstorage Array.
Assuming that controller c1 failed, enter the following ls(1) command. The command will report the symbolic link value of the SPARCstorage Array.
-
# ls -l /dev/rdsk/c1t0d0s0
lrwxrwxrwx 1 root root 92 Jun 25 12:11 /dev/rdsk/c1t0d0s0 -> ../../devices/io-
unit@f,e0200000/sbi@0,0/SUNW,soc@3,0/SUNW,pln@a0000000,7412bf/ssd@0,0:a,raw
|
- Another way to discover the WWN is with the ssaadm command. When you run ssaadm(1M) with the display option and specify a controller, all the information about the SPARCstorage Array is displayed. The serial number reported by ssaadm is the WWN.
-
-
Obtain the WWN from the pln path.
The WWN is the last 12 hexadecimal digits of the path component containing the characters pln from the symbolic link value of the SPARCstorage Array (not including commas).
-
-
-
Change the WWN.
Use the ssaadm command to change the WWN. For example, the following command would change the WWN to 0000007411f3.
-
# ssacli -s -w 0000007411f3 download c1
|
-
Note - The leading zeros must be entered as part of the WWN to make a total of 12 digits.
4.3.4 Removing a SPARCstorage Array Tray
- Before removing a SPARCstorage Array, tray you must halt all I/O and spin down all drives in the tray. The drives automatically spin up if I/O requests are made. Thus, it is necessary to stop all I/O before the drives are spun down.
- Stop Solstice DiskSuite I/O activity with the metaoffline(1M) command, which takes the submirror offline. (The metadetach(1M) command could be used to stop the I/O, but the resync cost is greater.) When the submirrors on a tray are taken offline, the corresponding mirrors will only provide one-way mirroring (that is, there will be no data redundancy). When the mirror is brought back online, an automatic resync occurs.
- Use the metastat(1M) command to identify all submirrors containing slices on the tray to be removed. Also, use the metadb(1M) command to identify any replicas on the tray. Any available hot spare devices must also be identified and the associated submirror identified using the metahs(1M) command.
- With all affected submirrors offline, I/O to the tray will be stopped.
- The ssaadm command is used to spin down the tray. When the tray lock light is out the tray may be removed and the required task performed.
4.3.5 Replacing a SPARCstorage Array Tray
- When you have completed work on a SPARCstorage Array tray, replace the tray in the chassis. The disks will automatically spin up. However if the disks fail to spin up, you can use the ssaadm command to manually spin up the entire tray. There is a short delay (several seconds) between starting drives in the SPARCstorage Array.
- After the disks have spun up, you must place online all the submirrors that were taken offline. When the metaonline(1M) command is run, an optimized resync operation automatically brings the submirrors up to date. The optimized resync copies only the regions of the disk that were modified while the submirror was offline. This is typically a very small fraction of the submirror capacity. You must also replace all metadevice state database replicas (metadb(1M)) and add back hot spares (metahs(1M)).
-
Note - If you used metadetach(1M) to detach the submirror rather than metaoffline, the entire submirror must be resynced. This typically takes about 10 minutes per Gbyte of data.
4.3.6 Replacing SPARCstorage Array Components
- The SPARCstorage Array components that can be replaced include the disks, fan tray, battery, tray, power supply, backplane, controller, optical module, and fibre channel cable.
- Some of the SPARCstorage Array components can be replaced without powering down the SPARCstorage Array. Other components require the SPARCstorage Array to be powered off. Consult the SPARCstorage Array documentation for details.
- To replace SPARCstorage Array components which require power off without interrupting Solstice HA services you perform the steps necessary for tray removal for all three trays in the SPARCstorage Array before turning off the power. This will include taking submirrors offline, deleting hot spare devices from hot spare pools, deleting metadevice state database replicas from drives, and spinning down the three trays.
- After these preparations, the SPARCstorage Array can be powered down and the components replaced.
-
Note - Because the SPARCstorage Array controller contains a unique World Wide Name, which identifies it to Solaris, special procedures apply for SPARCstorage Array controller replacement. Contact your service provider for assistance.
- After component replacement and power on follow the tray replacement procedures for all three trays.
4.4 Replacing Network Cables and Interfaces
- There are three types of failures that require the replacement of network cables and interfaces.These include:
-
- Public or client Ethernet cable failure
- Private network cable failure
- Public or private Ethernet interface failure
- The following procedures provide instructions for these replacements.
· How to Replace a Public or Client Ethernet Cable
-
-
Switch ownership of both logical hosts to the Solstice HA server that does not need an Ethernet cable replaced.
For instance, if the cable is being replaced on host1 enter the following:
-
host1# haswitch host2 logicalhost1 logicalhost2
|
-
-
Replace the cable using the appropriate hardware instructions in the SPARCcluster High Availability Server Service Manual</>.
-
-
Switch ownership of the logical hosts back to the appropriate default master.
For instance:
-
host1# haswitch host1 logicalhost1
|
· How to Replace a Private Network Cable
- When a private network cable fails, both servers will be aware that a private network connection is not working. The Solstice HA services should not be affected because of the second private network cable.
-
-
Unplug the faulty Ethernet cable and replace it with a new one.
You can use either Sun Microsystems' replacement parts number 530-2149 or 530-2150. If you are not using standard Sun parts, be sure the replacement Ethernet cable has the pairs crossed. Refer to Appendix B of the SPARCcluster High Availability Server Service Manual for cable information.
· How to Replace a Public or Private Ethernet Interface
-
-
Switch ownership of both logical hosts to the Solstice HA server that does not need a the Ethernet interface replaced.
For instance, if the interface is being replaced on host1 enter the following:
-
host1# haswitch host2 logicalhost1 logicalhost2
|
-
-
Shut down the Solstice HA software on host1.
-
host1# /etc/init.d/SUNWhadf stop
|
-
-
Halt the server.
-
-
-
Power off the server.
-
-
Replace the appropriate public or private Ethernet interface using the instructions in the SPARCcluster High Availability Server Service Manual.
-
Power on the server.
The server will automatically rejoin the Solstice HA configuration.
-
Switch ownership of the logical hosts back to the appropriate default master.
For example:
-
host1# haswitch host1 logicalhost1
|
4.5 System Board Replacement
- The Solstice DiskSuite component of Solstice HA is sensitive to the device numbering and can become confused if System Boards are moved around.
- When the server is booted initially, the SPARCstorage Array entries in the /dev directory are tied to the connection slot.
- For example, when the server is booted, System Board 0 and SBus slot 1 will be part of the identify of the SPARCstorage Array. If the board or SBus card is shuffled to a new location Solstice DiskSuite will be confused because Solaris will assign new controller numbers to the SBus controllers when they are in a new location.
-
Note - The SBus cards can be moved as long as the type of SBus card in a slot remains the same.
- Shuffling the fiber cables that lead to the SPARCstorage Arrays can also create problems. The System Boards on each of the Solstice HA servers must be configured identically (that is, the same type of SBus cards in each slot). When SBus cards are switched you must also reconnect the SPARCstorage Arrays back to the same SBus slot they were connected to before the changes.
4.6 Replacing SBus Cards
- Replacement of SBus cards in Solstice HA servers can be done by switching over the data services to the server that is functioning and performing the hardware procedure to replace the board. The logical hosts should be switched back to the default masters following the procedure.
· How to Replace an SBus Card
-
-
Switch ownership of both logical hosts to the Solstice HA server that does not need an SBus card replaced.
For instance, if the board is being replaced on host2 enter the following:
-
host1# haswitch host1 logicalhost1 logicalhost2
|
-
-
Stop Solstice HA on the affected server.
The the SUNWhadf stop command must be run on the host that has the failed SBus card.
-
host2# /etc/init.d/SUNWhadf stop
|
-
-
Halt the affected server.
-
-
-
Power off the server.
-
Perform the hardware replacement procedure.
Refer to the instructions in the appropriate hardware service manual that contains instructions for replacement of the specific SBus card.
-
Power on the server.
The server will automatically rejoin the Solstice HA configuration.
-
Switch the logical hosts back to the default masters.
-
host1# haswitch host2 logicalhost2
|
|
|