Inom
Hitta mer dokumentation
Supportresurser som ingår
| Ladda ner denna bok i PDF
Metadevice Creation on Local and Multi-host Disks
9
- This chapter contains the information necessary for creating metadevices on multi-host and local disks.
- Use the following table to locate specific information in this chapter.
-
9.1 Overview of Tasks
- After configuring the Solstice HA systems and disksets in Chapter 8, "Creating the Configuration," you are ready to create metadevices on the local and multi-host disks.
- Refer to the Solstice DiskSuite 4.0 Administration Guide and Solstice DiskSuite Tool 4.0 User's Guide to perform the procedures described in this chapter. These manuals are delivered as part of the SPARCcluster High Availability documentation set.
-
Note - When hasetup(1M) completed the Solstice HA configuration described in Chapter 8, it released ownership of the disksets. You must take ownership of each diskset using metaset(1M) before running metatool(1M) or the Solstice DiskSuite command line utilities to set up metadevices.
- For instance, if you are on host1 and want to take ownership of the diskset named logicalhost1, you would enter the following:
-
host1# metaset -s logicalhost1 -t
|
- After taking ownership, if you are using the metatool graphical user interface, you would enter the following:
-
host1# metatool -s logicalhost1
|
9.2 Metadevice Considerations
- Before creating the metadevices in your Solstice HA configuration, you should read the following subsections. These subsections offer a discussion of the following considerations:
-
- Diversity
- Availability
- Performance
- Takeover and metadevice state database replica issues
- Managing the metadevice name space
9.2.1 Diversity of the Configuration
- In order to create a Solstice HA configuration that will provide the most highly available service the configuration must be diverse. To achieve this diversity, consider the points where interconnections can cause a failure. These include:
-
- Connections of the disks to the strings
- Connections from the strings to the SPARCstorage Array chassis
- Connections of the SPARCstorage Array chassis through the fibre channel to the SBus card
-
- Connections made between the SBus card into the System Board
- Connections from the System Board into the processor can be considered as a "tree of interconnections" that would have the processor at the highest level and the disk at the lowest level
- A group of components, such as disk slices, could be said to have diversity up to but not including the level where their subtrees intersect. For example, two drives on the same string have drive diversity, but they do not have string diversity.
- If two drives inside two different SPARCstorage Array chassis have drive diversity, string diversity, chassis diversity, they may or may not have SBus diversity depending on whether they share the same SBus card. If they are different SBus cards they again may or may not have System Board diversity.
- Remember that the same interconnection tree usually defines another relationship related to robustness, that being fault dominance. In particular, a fault on a specific piece of hardware in the tree typically dominates all other pieces of hardware below it in the tree. For example, the piece of hardware representing a SPARCstorage Array chassis dominates the six disk strings within the chassis as well as the disks on those strings. Thus a failure of the SPARCstorage Array chassis (such as a power supply) will cause a failure of either a string or a disk. If the server has sufficient availability to tolerate the failure of a dominant component it will also tolerate the failure of the dominated components.
- You can often focus your attention on the larger components in a configuration such as the SPARCstorage Array chassis. If the failure of that component can be tolerated then in general failure of all components below it in the tree can also be tolerated.
9.2.2 Availability of the Configuration
- In a Solstice HA configuration the primary goal should be availability. The basis of availability is that the data must be mirrored to survive disk failures. This is because disk failures could result in permanent loss of access to the data. The submirrors should be placed in such a way that other failures such as controllers or power supplies, which result in loss of access to the data, will be tolerated.
- A disk failure usually means the data on that piece of media is lost while a controller failure means that only access to the data is lost until the controller is replaced. Data generally becomes available when the access failure is repaired. That is because the controller, string, and other associated hardware do not contain any essential user data. They serve as conduits for the user data. However, this is not true of the drive itself which contains the user data.
- There are situations where a controller failure (ordinarily considered an access failure) results in lost data. For instance, scribbling can occur on the disk. Usually these types of data losses are rare.
- The key to protecting against loss of data is mirroring. The two submirrors should be separated sufficiently in the System Board, SBus, and string space to survive all failures. For access to a mirror to survive the failure of a specific component the submirrors must have diversity at the level of the failed component.
- Sufficient availability should be provided if the submirrors connected to a Solstice HA server are on separate SPARCstorage Array chassis. You could carry the analysis to the SBus card or System Board level. But these components are dominated by power supply failures, either internal or through power loss. You are be protected against that type of failure by the failover mechanism.
- It is important to remember that the faults protected by the Solstice DiskSuite software are not usually visible to the applications or the Solstice HA fault monitor. Thus, you must monitor and promptly repair the problems in order to retain availability.
- When the components (slices) of one submirror (as well as its hot spares) are compared with the components of the other submirror for the common parents in the diversity tree, a failure at the junction where the two submirrors meet will result in the loss of access to the data. If that common junction is a drive (mirroring two slices on the same drive) then loss of that drive will result in loss of data, not just loss of access. This is the fundamental rule guiding metadevice availability.
- It is important to consider the faults of interest. The easiest are disk failures and SPARCstorage Array chassis failures.
- When a SPARCstorage Array tray fails, the online repair involves removing the tray, which introduces at least an artificial access failure while the tray is removed.
9.2.3 Performance Issues
- You can enhance the disk I/O bandwidth by properly setting up metadevices. Disk striping can also increase performance by distributing sequential I/O operations across several disk devices. The key to performance here is also the "distance" between the disk devices in the System Board, SBus, and string space (see "Diversity of the Configuration" on page 9-2).
- Maximizing the diversity by pushing it to the highest "level" in the tree between the slices within a stripe is the way to improve performance for a single thread sequential I/O. The result will be that most of the pieces of the server will be working in parallel. This is based on a single thread sequential reference model, typically with large transfer sizes. If you know your application's typical transfer length you should plan your stripes so that the number of stripes multiplied by the interleave factor roughly equals the typical transfer length. For example, if the typical transfer length is 64 Kbytes, an 8-way stripe with an 8 Kbyte interleave value might be best. Alternatively, you could create a 4-way stripe with a 16 Kbyte interleave value.
- Naturally, if your site is serious about performance you will perform benchmark studies, where the parameters can be varied and an ideal value for the expected reference pattern could be found.
- If you have multiple threads of reference, as is the case with most time sharing systems like UNIX, you may have some inherent concurrency that operates independently of the striping of the disks. You may end up with several threads of reference, some of which may be sequential, but taken together appear somewhat random over time. Still, for things like UFS, these references do have a typical transfer size that should be factored into the striping parameters. This usually depends on the UFS block size, which is often the page size of the host architecture, and the MAXCONTIG UFS parameter set via the tunefs(1M) command.
- Database applications probably have similar typical transfer lengths that will dictate the striping parameters you use.
- An analysis of the common parent (see "Diversity of the Configuration" on page 9-2) will review potential performance bottlenecks. All the hardware components of the tree below the common parent presumably can operate independently. Only at the common parent does interference begin. For example two drives on the same string can operate independently so far as the disk arm position is concerned. This is because a seek operation on one disk
- does not affect a seek on another disk, but transfer operations on the string can interfere and affect performance. This interaction may not reduce performance but reduced performance is the typical result.
9.2.4 Takeover and Metadevice State Database Issues
- The most important issue to consider about metadevice state database replicas is to ensure that enough replicas are available after the various possible faults have occurred. There must be a majority of replicas which survive a particular fault or the disksets cannot be taken over. This applies at boot time also, when a takeover is uncontested. There is also a requirement that a majority of the drives in the diskset are available.
- The easiest way to ensure there are an adequate number of replicas is to evenly distribute the disks across a set of three or more SPARCstorage Array chassis. If that is done for N chassis then in the limit (N-1)/N of the drives will be available if one of the N chassis should fail. This results in a majority for all N>2. This represents the default configuration of three SPARCstorage Arrays.
- When the replicas are created by the metaset(1M) command, they are fairly evenly distributed across the SPARCstorage Array strings, up to a limit of 50 replicas, so the replica survival scenario parallels the drive survival scenario quite closely.
- The three rules that dictate metadevice state database replica operation are:
-
- A server will panic if less than one half of the replicas are in service at any point in time.
- A server will continue operation if exactly one half of the replicas are in service. However, a reconfiguration or takeover will fail.
- A successful takeover requires successful disk reservation on the majority of the drives in a diskset.
9.2.5 Managing the Metadevice Name Space
- If you are configuring a large Solstice HA configuration you should take time to plan the number of metadevices to use. By default there are 128 unique metadevice names provided by Solstice DiskSuite in each diskset.
- Metadevice names begin with "d" and are followed by a number in the range 0 to 127. Each UFS logging (trans) device you create will use at least seven metadevice names. Thus in a large Solstice HA configuration it is possible that you will need additional metadevice names. Refer to Appendix A of the Solstice DiskSuite 4.0 Administration Guide for instructions on changing the number.
- During the planning you should:
-
- Determine the number of metadevice names needed
- Ensure there are enough names for growth
- Develop a naming convention that will allow for easy identification of devices
- Ideally you should create a metadevice naming scheme for your site that will allow you to immediately recognize the slices that are part of a certain trans device. This will help you when you are performing administration on the Solstice HA configuration.
- The following is an example naming convention could be used for trans devices. In the following convention, n can be either an empty string or a number in the range 1 to 12. Larger n could be used if you have edited the /kernel/drv/md.conf file to increase the value of nmd. This convention will allow for easy identification of metadevices.
-
-
dn0 - The trans device
-
dn1 - The UFS master mirror
-
dn2 - First submirror of UFS master mirror
-
dn3 - Second submirror of UFS master mirror
-
dn4 - The UFS log mirror
-
dn5 - First submirror of UFS log mirror
-
dn6 - Second submirror of UFS log mirror
-
dn7 - Not used (available for future mirror use)
-
dn8 - Not used (available for future mirror use)
-
dn9 - Not used (available for future mirror use)
- The example trans naming convention is illustrated in Figure 9-1

Figure 9-1
- In a configuration that is running the HA-ORACLE data service, the following example naming convention could be used for raw mirrored metadevices. In the following convention, n can be either an empty string or a number in the range 1 to 11. Larger n could be used if you have edited the /kernel/drv/md.conf file to increase the value of nmd. This convention will allow for easy identification of metadevices.
-
-
dn0 - Not used
-
dn1 - First mirror device
-
dn2 - First submirror of first mirror
-
dn3 - Second submirror of first mirror
-
dn4 - Second mirror device
-
dn5 - First submirror of second mirror
-
dn6 - Second submirror of second mirror
-
dn7 - Third mirror device
-
dn8 - First submirror of third mirror
-
dn9 - Second submirror of third mirror
- The example raw mirrored metadevice naming convention is illustrated in Figure 9-2.

Figure 9-2
- Using the suggested metadevice naming convention shown in the Figure 9-2, you can create an /etc/opt/SUNWmd/md.tab file to set up the metadevice. The file will be referred to as the md.tab file.
-
Caution - The md.tab file can be used safely to create the initial Solstice HA configuration. However, the md.tab file should not be used when adding metadevices to the configuration. It is also important to note that the md.tab file is not automatically updated when the configuration is changed using Solstice DiskSuite commands. Refer to the Solstice DiskSuite 4.0 Administration Guide for detailed information.
- The ordering of lines in the md.tab file is not important but it is suggested you construct your file in a "top down" fashion following the diagram. The following example defines the metadevices for the diskset named logicalhost1, as required by Solstice HA conventions.
- An example md.tab file would appear as follows. Note that the # character can be used to add annotation to the file.
-
# Example md.tab file
logicalhost1/d10 -t logicalhost1/d11 logicalhost1/d14
logicalhost1/d11 -m logicalhost1/d12 logicalhost1/d13
logicalhost1/d12 1 2 c1t0d0s0 c1t0d1s0
logicalhost1/d13 1 2 c2t0d0s0 c2t0d1s0
logicalhost1/d14 -m logicalhost1/d15 logicalhost1/d16
logicalhost1/d15 1 1 c1t1d0s6
logicalhost1/d16 1 1 c2t1d0s6
|
- The first line defines the trans device d10 to consist of a master (UFS) metadevice d11 and a log device d14. The -t signifies this is a trans device, the master and log device are implied by their position after the -t flag.
-
logicalhost1/d10 -t logicalhost1/d11 logicalhost1/d14
|
- The second line defines the master device as a mirror of the metadevices. The -m in this definition signifies a mirror device.
-
logicalhost1/d11 -m logicalhost1/d12 logicalhost1/d13
|
- The fifth line similarly defines the log device.
-
logicalhost1/d14 -m logicalhost1/d15 logicalhost1/d16
|
- The fourth line defines the first submirror of the master device as a two way stripe on two disks. Note that the usual /dev/dsk or /dev/rdsk prefix may be omitted in the md.tab file.
-
logicalhost1/d12 1 2 c1t0d0s0 c1t0d1s0
|
- The next line defines the other master submirror.
-
logicalhost1/d13 1 2 c2t0d0s0 c2t0d1s0
|
- Finally, the submirrors are defined. In this example, simple metadevices for each submirror are created. No stripes or concatenations are used.
-
logicalhost1/d15 1 1 c1t1d0s6
logicalhost1/d16 1 1 c2t1d0s6
|
- Assuming the diskset is owned appropriately, the metadevices can now be set up using the metainit(1M) command.
-
# metainit -s logicalhost1 -a
|
- The metainit command will read the md.tab file and construct the metadevices in a "bottom up" fashion. As metainit constructs the mirror devices it will issue the following warning.
-
WARNING: This form of metainit is not recommended.
The submirrors may not have the same data.
Please see ERRORS in metainit(1M) for additional information.
|
- When setting up the initial Solstice HA configuration, the form used is acceptable because we have no existing data on the submirrors. After metainit finishes, you must run newfs(1M) to create a UFS file system before using the disks.
- If you have existing data on the disks that will be used for the submirrors you must dump the data before metadevice setup and restore it onto the mirror.
-
Note - It is possible to create a one-way mirror that contains the existing data and then use the metattach(1M) command to attach the other submirror. This approach ensures that both sides of the mirror contain the preexisting data. This procedure is a multistep operation. The procedure cannot be done with metainit alone. You must also run metattach after metainit completes. Refer to the Solstice DiskSuite 4.0 Administration Guide for additional details.
9.3 Creation of Metadevices on Multi-host Disks
- You can either use the metatool command or the md.tab file to create metadevices on the multi-host disks.
· How to Create Metadevices Using metatool
-
-
Take ownership of the logical host where the metadevices will be set up.
You will use the metaset(1M) command to take ownership of the disksets.
-
# metaset -s logicalhost -t
|
-
-
Bring up the metatool graphical user interface on the Solstice HA server with the display redirected to your workstation.
You may have to use the xhost(1) command on your workstation to enable metatool to display there.
-
host1# metatool -display workstation:0 -s logicalhost
Initializing metatool... Done.
Looking up metaset relo-host1... Done.
Checking metaset ownership... Done.
Discovering drives and slices... Done.
Discovering database replicas... Done.
Discovering hot spare pools... Done.
Discovering concat/stripes...Done.
Discovering RAID devices... Done.
Discovering mirrors... Done.
Discovering trans devices... Done.
Updating mount and swap information... Done.
|
-
-
Use the instructions in the Solstice DiskSuite Tool 4.0 User's Guide to create the desired metadevice configuration on the logicalhost.
· How to Create Metadevices Using the md.tab File
-
-
Create the /etc/opt/SUNWmd/md.tab file.
Use the example in "Managing the Metadevice Name Space" on page 9-6 to create the md.tab file.
-
Take ownership of the logical host where the metadevices will be set up.
You will use the metaset command to take ownership of the disksets.
-
# metaset -s logicalhost -t
|
-
-
Run the metainit command to set up the devices.
The metainit command will use the md.tab file to create the metadevices.
-
# metainit -s logicalhost -a
|
9.4 Logging File Systems on the Local Disk
- If you followed the instructions in Chapter 5, "Solaris 2.4 Installation," and left six Mbytes of space on slice 4, you can now use that space to log all or some of the /usr, /opt, and /var file systems. Logging these file systems is not mandatory.
- You cannot log the root file system or swap partition.
-
Note - When determining whether or not to log these file systems, keep in mind that you must disable the logging for /usr, /var, /opt, or any other file system used during a Solaris upgrade or installation.
- To create the UFS logging for these file systems, you must make a trans device for each file system with the file system as the master and slice 4 as the log. The log will be shared by these file systems. After you have created trans devices for the three file systems you must reboot the machine.
- Unlike the multi-host file systems, the local file systems need not be mirrored to be logged.
- For complete instructions on creating UFS logs, refer to Chapter 6 of the Solstice DiskSuite Tool 4.0 User's Guide.
9.5 Setting Up Slices on Multi-host Disks
- When you created the disksets using hasetup, slice 7 was allocated for a metadevice state database replica, slice 6 was set aside for UFS logging (if you selected this option during configuration), and the remainder of the disk was set aside for slice 0.
- You can use the format(1M) command to repartition the disks, however, do not delete, resize, or move slice 7.
- If you responded to queries from hasetup(1M) during the configuration procedure in Chapter 8, "Creating the Configuration," slice 6 has been set aside for UFS logging.
9.6 Setting Up Metadevices for HA-ORACLE
- HA-ORACLE may be configured to use UFS file system logging or raw mirrored metadevices.
9.7 Setting Up Metadevices for HA-NFS
- If you are running HA-NFS on the Solstice HA systems, you must create one or more trans devices that contains a mirrored log and a mirrored master. The submirrors can consist of either concatenations or stripes.
- If you decide to use metatool(1M) to create the metadevices, refer to the Solstice DiskSuite Tool 4.0 User's Guide for details on using this utility.
|
|