Chapter 23 Monitoring and Error Reporting (Tasks)
When Solaris Volume Manager encounters a problem, such as being unable
to write to a volume due to physical errors at the slice level, it changes
the status of the volume so system administrators can stay informed. However,
unless you regularly check the status in the Solaris Volume Manager graphical user
interface through the Solaris Management Console, or by running the metastat command, you might not see these status changes in a timely
fashion.
This chapter provides information about various monitoring tools available
for Solaris Volume Manager, including the Solaris Volume Manager SNMP agent, which is a
subagent of the Solstice Enterprise AgentsTM monitoring
software. In addition to configuring the Solaris Volume Manager SNMP agent to report
SNMP traps, you can create a shell script to actively monitor many Solaris Volume Manager
functions. Such a shell script can run as a cron job and
be valuable in identifying issues before they become problems.
This is a list of the information in this chapter:
Solaris Volume Manager Monitoring and Reporting (Task Map)
The following task map identifies the procedures needed to manage Solaris Volume Manager
error reporting.
Setting the mdmonitord Command for Periodic Error
Checking
Solaris Volume Manager includes the /usr/sbin/mdmonitord
daemon, which is a program that checks Solaris Volume Manager volumes for errors.
By default, this program checks all volumes for errors only when an error
is detected (for example, through a write error) on a volume. However, you
can set this program to actively check for errors at an interval you specify.
How to Configure the mdmonitord Command for Periodic
Error Checking
The /etc/rc2.d/S95svm.sync script starts the mdmonitord command at boot time. Edit the /etc/rc2.d/S95svm.sync script to add a time interval for periodic checking.
-
Become superuser.
-
Edit the /etc/rc2.d/S95svm.sync script and change
the line that starts the mdmonitord command by adding a - t flag and the number of seconds between checks.
if [ -x $MDMONITORD ]; then
$MDMONITORD -t 3600
error=$?
case $error in
0) ;;
*) echo "Could not start $MDMONITORD. Error $error."
;;
esac
fi
|
-
Stop and restart the mdmonitord command to activate
your changes.
# /etc/rc2.d/S95svm.sync stop
# /etc/rc2.d/S95svm.sync start
|
For more information, see mdmonitord(1M).
Solaris Volume Manager SNMP Agent Overview
The Solaris Volume Manager SNMP trap agent requires both the core packages SUNWlvmr and SUNWlvma and the Solstice Enterprise
Agent packages. Those packages include the following:
-
SUNWmibii
-
SUNWsacom
-
SUNWsadmi
-
SUNWsasnm
-
SUNWsasnx
These packages are part of the Solaris operating environment and are
normally installed by default unless the package selection was modified at
install time or a minimal set of packages was installed. After you confirm
that all five packages are available (by using the pkginfo pkgname command, as in pkginfo SUNWsasnx), you need to configure the Solaris Volume Manager SNMP agent, as described
in the following section.
Configuring the Solaris Volume Manager SNMP Agent
The Solaris Volume Manager SNMP agent is not enabled by default. Use the following
procedure to enable SNMP traps.
How to Configure the Solaris Volume Manager SNMP Agent
-
Become superuser.
-
Move the /etc/snmp/conf/mdlogd.rsrc– configuration
file to /etc/snmp/conf/mdlogd.rsrc.
# mv /etc/snmp/conf/mdlogd.rsrc- /etc/snmp/conf/mdlogd.rsrc
|
-
Edit the /etc/snmp/conf/mdlogd.acl file to specify
which hosts should receive SNMP traps. Look in the file for the following:
trap = {
{
trap-community = SNMP-trap
hosts = corsair
{
enterprise = "Solaris Volume Manager"
trap-num = 1, 2, 3
}
|
Change the line that containshosts = corsair
to specify the host name that you want to receive Solaris Volume Manager SNMP traps.
For example, to send SNMP traps to lexicon, you would edit
the line to hosts = lexicon. If you want to include multiple
hosts, provide a comma-delimited list of host names, as in hosts
= lexicon, idiom.
-
Also edit the /etc/snmp/conf/snmpdx.acl file to
specify which hosts should receive the SNMP traps.
Find the block that begins with trap = and add the
same list of hosts that you added in the previous step. This section might
be commented out with #'s. If so, you must remove the # at the beginning of
the required lines in this section. Additional lines in the trap section are
also commented out, but you can leave those lines alone or delete them for
clarity. After uncommenting the required lines and updating the hosts line,
this section could look like this:
###################
# trap parameters #
###################
trap = {
{
trap-community = SNMP-trap
hosts =lexicon
{
enterprise = "sun"
trap-num = 0, 1, 2-5, 6-16
}
# {
# enterprise = "3Com"
# trap-num = 4
# }
# {
# enterprise = "snmp"
# trap-num = 0, 2, 5
# }
# }
# {
# trap-community = jerry-trap
# hosts = jerry, nanak, hubble
# {
# enterprise = "sun"
# trap-num = 1, 3
# }
# {
# enterprise = "snmp"
# trap-num = 1-3
# }
}
}
|
Note –
Make sure that you have the same number of opening and closing
brackets in the /etc/snmp/conf/snmpdx.acl file.
-
Add a new Solaris Volume Manager section to the /etc/snmp/conf/snmpdx.acl file, inside the section you that uncommented in the previous
step.
trap-community = SNMP-trap
hosts = lexicon
{
enterprise = "sun"
trap-num = 0, 1, 2-5, 6-16
}
{
enterprise = "Solaris Volume Manager"
trap-num = 1, 2, 3
}
|
Note that the added four lines are placed
immediately after the enterprise = “sun” block.
-
Append the following line to the /etc/snmp/conf/enterprises.oid file:
"Solaris Volume Manager" "1.3.6.1.4.1.42.104"
|
-
Stop and restart the Solstice Enterprise Agents server.
# /etc/init.d/init.snmpdx stop
# /etc/init.d/init.snmpdx start
|
Note –
Whenever you upgrade your Solaris operating environment, you will
probably need to edit the/etc/snmp/conf/enterprises.oid
file and append the line in Step 6 again, then
restart the Solaris Enterprise Agents server.
After you have completed this procedure, your system will issue SNMP
traps to the host or hosts that you specified. You will need to use an appropriate
SNMP monitor, such as Solstice Enterprise Agents software, to view the traps
as they are issued.
Note –
Set the mdmonitord command to probe your system
regularly to help ensure that you receive traps if problems arise. See Setting the mdmonitord Command for Periodic Error
Checking. Also, refer to Monitoring Solaris Volume Manager with a cron Job for
additional error-checking options.
Solaris Volume Manager SNMP Agent Limitations
The Solaris Volume Manager SNMP agent has certain limitations, and will not
issue traps for all Solaris Volume Manager problems that system administrators will
likely need to know about. Specifically, the agent issues traps only in the following instances:
-
A RAID 1 or RAID 5 subcomponent goes into “needs maintenance”
state
-
A hot spare is swapped into service
-
A hot spare starts to resynchronize
-
A hot spare completes resynchronization
-
A mirror is taken offline
-
A disk set is taken by another host and the current host panics
Many problematic situations, such as an unavailable disk with RAID 0
volumes or soft partitions on it, do not result in SNMP traps, even when reads
and writes to the device are attempted. SCSI or IDE errors are generally reported
in these cases, but other SNMP agents must issue traps for those errors to
be reported to a monitoring console.
Monitoring Solaris Volume Manager with a cron Job
How to Automate Checking for Errors in Volumes
To automatically check your Solaris Volume Manager configuration for errors,
create a script that the cron utility can periodically.
The following example shows a script that you can adapt and modify for
your needs.
Note –
This script serves as a starting point for automating Solaris Volume Manager
error checking. You will probably need to modify this script for your own
configuration.
#
#ident "@(#)metacheck.sh 1.3 96/06/21 SMI"
#!/bin/ksh
#!/bin/ksh -x
#!/bin/ksh -v
# ident='%Z%%M% %I% %E% SMI'
#
# Copyright (c) 1999 by Sun Microsystems, Inc.
#
# metacheck
#
# Check on the status of the metadevice configuration. If there is a problem
# return a non zero exit code. Depending on options, send email notification.
#
# -h
# help
# -s setname
# Specify the set to check. By default, the 'local' set will be checked.
# -m recipient [recipient...]
# Send email notification to the specified recipients. This
# must be the last argument. The notification shows up as a short
# email message with a subject of
# "Solaris Volume Manager Problem: metacheck.who.nodename.setname"
# which summarizes the problem(s) and tells how to obtain detailed
# information. The "setname" is from the -s option, "who" is from
# the -w option, and "nodename" is reported by uname(1).
# Email notification is further affected by the following options:
# -f to suppress additional messages after a problem
# has been found.
# -d to control the supression.
# -w to identify who generated the email.
# -t to force email even when there is no problem.
# -w who
# indicate who is running the command. By default, this is the
# user-name as reported by id(1M). This is used when sending
# email notification (-m).
# -f
# Enable filtering. Filtering applies to email notification (-m).
# Filtering requires root permission. When sending email notification
# the file /etc/lvm/metacheck.setname.pending is used to
# controll the filter. The following matrix specifies the behavior
# of the filter:
#
# problem_found file_exists
# yes no Create file, send notification
# yes yes Resend notification if the current date
# (as specified by -d datefmt) is
# different than the file date.
# no yes Delete file, send notification
# that the problem is resolved.
# no no Send notification if -t specified.
#
# -d datefmt
# Specify the format of the date for filtering (-f). This option
# controls the how often re-notification via email occurs. If the
# current date according to the specified format (strftime(3C)) is
# identical to the date contained in the
# /etc/lvm/metacheck.setname.pending file then the message is
# suppressed. The default date format is "%D", which will send one
# re-notification per day.
# -t
# Test mode. Enable email generation even when there is no problem.
# Used for end-to-end verification of the mechanism and email addresses.
#
#
# These options are designed to allow integration of metacheck
# into crontab. For example, a root crontab entry of:
#
# 0,15,30,45 * * * * /usr/sbin/metacheck -f -w SVMcron \
# -d '\%D \%h' -m notice@example.com 2148357243.8333033@pager.example.com
#
# would check for problems every 15 minutes, and generate an email to
# notice@example.com (and send to an email pager service) every hour when
# there is a problem. Note the \ prior to the '%' characters for a
# crontab entry. Bounced email would come back to root@nodename.
# The subject line for email generated by the above line would be
# Solaris Volume Manager Problem: metacheck.SVMcron.nodename.local
#
# display a debug line to controlling terminal (works in pipes)
decho()
{
if [ "$debug" = "yes" ] ; then
echo "DEBUG: $*" < /dev/null > /dev/tty 2>&1
fi
}
# if string $1 is in $2-* then return $1, else return ""
strstr()
{
typeset look="$1"
typeset ret=""
shift
# decho "strstr LOOK .$look. FIRST .$1."
while [ $# -ne 0 ] ; do
if [ "$look" = "$1" ] ; then
ret="$look"
fi
shift
done
echo "$ret"
}
# if string $1 is in $2-* then delete it. return result
strdstr()
{
typeset look="$1"
typeset ret=""
shift
# decho "strdstr LOOK .$look. FIRST .$1."
while [ $# -ne 0 ] ; do
if [ "$look" != "$1" ] ; then
ret="$ret $1"
fi
shift
done
echo "$ret"
}
merge_continued_lines()
{
awk -e '\
BEGIN { line = "";} \
$NF == "\\" { \
$NF = ""; \
line = line $0; \
next; \
} \
$NF != "\\" { \
if ( line != "" ) { \
print line $0; \
line = ""; \
} else { \
print $0; \
} \
}'
}
# trim out stuff not associated with metadevices
find_meta_devices()
{
typeset devices=""
# decho "find_meta_devices .$*."
while [ $# -ne 0 ] ; do
case $1 in
d+([0-9]) ) # metadevice name
devices="$devices $1"
;;
esac
shift
done
echo "$devices"
}
# return the list of top level metadevices
toplevel()
{
typeset comp_meta_devices=""
typeset top_meta_devices=""
typeset devices=""
typeset device=""
typeset comp=""
metastat$setarg -p | merge_continued_lines | while read line ; do
echo "$line"
devices=`find_meta_devices $line`
set -- $devices
if [ $# -ne 0 ] ; then
device=$1
shift
# check to see if device already refered to as component
comp=`strstr $device $comp_meta_devices`
if [ -z $comp ] ; then
top_meta_devices="$top_meta_devices $device"
fi
# add components to component list, remove from top list
while [ $# -ne 0 ] ; do
comp=$1
comp_meta_devices="$comp_meta_devices $comp"
top_meta_devices=`strdstr $comp $top_meta_devices`
shift
done
fi
done > /dev/null 2>&1
echo $top_meta_devices
}
#
# - MAIN
#
METAPATH=/usr/sbin
PATH=//usr/bin:$METAPATH
USAGE="usage: metacheck [-s setname] [-h] [[-t] [-f [-d datefmt]] \
[-w who] -m recipient [recipient...]]"
datefmt="%D"
debug="no"
filter="no"
mflag="no"
set="local"
setarg=""
testarg="no"
who=`id | sed -e 's/^uid=[0-9][0-9]*(//' -e 's/).*//'`
while getopts d:Dfms:tw: flag
do
case $flag in
d) datefmt=$OPTARG;
;;
D) debug="yes"
;;
f) filter="yes"
;;
m) mflag="yes"
;;
s) set=$OPTARG;
if [ "$set" != "local" ] ; then
setarg=" -s $set";
fi
;;
t) testarg="yes";
;;
w) who=$OPTARG;
;;
\?) echo $USAGE
exit 1
;;
esac
done
# if mflag specified then everything else part of recipient
shift `expr $OPTIND - 1`
if [ $mflag = "no" ] ; then
if [ $# -ne 0 ] ; then
echo $USAGE
exit 1
fi
else
if [ $# -eq 0 ] ; then
echo $USAGE
exit 1
fi
fi
recipients="$*"
curdate_filter=`date +$datefmt`
curdate=`date`
node=`uname -n`
# establish files
msg_f=/tmp/metacheck.msg.$$
msgs_f=/tmp/metacheck.msgs.$$
metastat_f=/tmp/metacheck.metastat.$$
metadb_f=/tmp/metacheck.metadb.$$
metahs_f=/tmp/metacheck.metahs.$$
pending_f=/etc/lvm/metacheck.$set.pending
files="$metastat_f $metadb_f $metahs_f $msg_f $msgs_f"
rm -f $files > /dev/null 2>&1
trap "rm -f $files > /dev/null 2>&1; exit 1" 1 2 3 15
# Check to see if metadb is capable of running
have_metadb="yes"
metadb$setarg > $metadb_f 2>&1
if [ $? -ne 0 ] ; then
have_metadb="no"
fi
grep "there are no existing databases" < $metadb_f > /dev/null 2>&1
if [ $? -eq 0 ] ; then
have_metadb="no"
fi
grep "/dev/md/admin" < $metadb_f > /dev/null 2>&1
if [ $? -eq 0 ] ; then
have_metadb="no"
fi
# check for problems accessing metadbs
retval=0
if [ "$have_metadb" = "no" ] ; then
retval=1
echo "metacheck: metadb problem, can't run '$METAPATH/metadb$setarg'" \
>> $msgs_f
else
# snapshot the state
metadb$setarg 2>&1 | sed -e '1d' | merge_continued_lines > $metadb_f
metastat$setarg 2>&1 | merge_continued_lines > $metastat_f
metahs$setarg -i 2>&1 | merge_continued_lines > $metahs_f
#
# Check replicas for problems, capital letters in the flags
# indicate an error, fields are seperated by tabs.
#
problem=`awk < $metadb_f -F\t '{if ($1 ~ /[A-Z]/) print $1;}'`
if [ -n "$problem" ] ; then
retval=`expr $retval + 64`
echo "\
metacheck: metadb problem, for more detail run:\n\t$METAPATH/metadb$setarg -i" \
>> $msgs_f
fi
#
# Check the metadevice state
#
problem=`awk < $metastat_f -e \
'/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'`
if [ -n "$problem" ] ; then
retval=`expr $retval + 128`
echo "\
metacheck: metadevice problem, for more detail run:" \
>> $msgs_f
# refine the message to toplevel metadevices that have a problem
top=`toplevel`
set -- $top
while [ $# -ne 0 ] ; do
device=$1
problem=`metastat $device | awk -e \
'/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'`
if [ -n "$problem" ] ; then
echo "\t$METAPATH/metastat$setarg $device" >> $msgs_f
# find out what is mounted on the device
mp=`mount|awk -e '/\/dev\/md\/dsk\/'$device'[ \t]/{print $1;}'`
if [ -n "$mp" ] ; then
echo "\t\t$mp mounted on $device" >> $msgs_f
fi
fi
shift
done
fi
#
# Check the hotspares to see if any have been used.
#
problem=""
grep "no hotspare pools found" < $metahs_f > /dev/null 2>&1
if [ $? -ne 0 ] ; then
problem=`awk < $metahs_f -e \
'/blocks/ { if ( $2 != "Available" ) print $0;}'`
fi
if [ -n "$problem" ] ; then
retval=`expr $retval + 256`
echo "\
metacheck: hot spare in use, for more detail run:\n\t$METAPATH/metahs$setarg -i" \
>> $msgs_f
fi
fi
# If any errors occurred, then mail the report
if [ $retval -ne 0 ] ; then
if [ -n "$recipients" ] ; then
re=""
if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then
re="Re: "
# we have a pending notification, check date to see if we resend
penddate_filter=`cat $pending_f | head -1`
if [ "$curdate_filter" != "$penddate_filter" ] ; then
rm -f $pending_f > /dev/null 2>&1
else
if [ "$debug" = "yes" ] ; then
echo "metacheck: email problem notification still pending"
cat $pending_f
fi
fi
fi
if [ ! -f $pending_f ] ; then
if [ "$filter" = "yes" ] ; then
echo "$curdate_filter\n\tDate:$curdate\n\tTo:$recipients" \
> $pending_f
fi
echo "\
Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate" >> $msg_f
echo "\
--------------------------------------------------------------" >> $msg_f
cat $msg_f $msgs_f | mailx -s \
"${re}Solaris Volume Manager Problem: metacheck.$who.$set.$node" $recipients
fi
else
cat $msgs_f
fi
else
# no problems detected,
if [ -n "$recipients" ] ; then
# default is to not send any mail, or print anything.
echo "\
Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate" >> $msg_f
echo "\
--------------------------------------------------------------" >> $msg_f
if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then
# pending filter exista, remove it and send OK
rm -f $pending_f > /dev/null 2>&1
echo "Problem resolved" >> $msg_f
cat $msg_f | mailx -s \
"Re: Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients
elif [ "$testarg" = "yes" ] ; then
# for testing, send mail every time even thought there is no problem
echo "Messaging test, no problems detected" >> $msg_f
cat $msg_f | mailx -s \
"Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients
fi
else
echo "metacheck: Okay"
fi
fi
rm -f $files > /dev/null 2>&1
exit $retval
|
For information on invoking scripts by using the cron
utility, see the cron(1M)
man page.