Chapter 26 Troubleshooting Software Problems (Overview)
This chapter provides a general overview of troubleshooting
software problems, including information on troubleshooting system crashes
and viewing system messages.
This is a list of information in this chapter.
What's New in Troubleshooting Software Problems?
This section describes features that are new in the Solaris 9 release.
New System Log Rotation
In Solaris 9 release, system log files are now rotated by the logadm command from an entry in the root crontab
file. The /usr/lib/newsyslog script is no longer used.
The new system log rotation is defined in the /etc/logadm.conf file. This file includes log rotation entries for processes such
as syslogd. For example, one entry in the /etc/logadm.conf file specifies that the /var/log/syslog file
is rotated weekly unless the file is empty. The most recent syslog file becomes syslog.0, the next most recent
becomes syslog.1, and so on. Eight previous syslog log files are kept.
The /etc/logadm.conf file also contains time stamps
of when the last log rotation occurred.
You can use the logadm command to customize system
logging and to add additional logging in the /etc/logadm.conf
file as needed.
For example, to rotate the Apache access and error logs, use the following
commands:
# logadm -w /var/apache/logs/access_log -s 100m
# logadm -w /var/apache/logs/error_log -s 10m
|
In this example, the Apache access_log file is
rotated when it reaches 100 Mbytes in size, with a .0, .1, (and so on) suffix, keeping 10 copies of the old access_log file. The error_log is rotated when it reaches 10 Mbytes
in size with the same suffixes and number of copies as the access_log file.
The /etc/logadm.conf entries for the preceding
Apache log rotation examples look similar to the following:
# cat /etc/logadm.conf
.
.
.
/var/apache/logs/error_log -s 10m
/var/apache/logs/access_log -s 100m
|
For more information, see logadm(1M).
You can use the logadm command as superuser or by
assuming an equivalent role (with Log Management rights). With role-based
access control (RBAC), you can grant non-root users the privilege of maintaining
log files by providing access to the logadm command.
For example, add the following entry to the /etc/user_attr file to grant user andy the ability to use
the logadm command:
andy::::profiles=Log Management
|
Or, you can set up a role for log management by using the Solaris Management
Console. For more information about setting up a role, see “Role-Based Access Control (Overview)” in System Administration Guide: Security Services.
New Fall Back Shell for root Account
If you changed root's shell to a non-existent shell in previous Solaris
releases, you were forced to boot the system from a local CD or from the network
and correct the root shell entry in the /etc/passwd
file.
If you mistakenly provide a non-existent shell for root in the Solaris
9 release, root's shell will automatically fall back to /sbin/sh when one of the following occurs:
For more information, see su(1M).
Where to Find Software Troubleshooting Tasks
Troubleshooting a System Crash
If a system running the Solaris operating environment crashes, provide
your service provider with as much information as possible, including crash
dump files.
What to Do if the System Crashes
The most important things to remember are:
-
Write down the system console messages.
If a system crashes, making it run again might seem like your most pressing
concern. However, before you reboot the system, examine the console screen
for messages. These messages can provide some insight about what caused the
crash. Even if the system reboots automatically and the console messages have
disappeared from the screen, you might be able to check these messages by
viewing the system error log, the/var/adm/messages file.
For more information about viewing system error log files, see How to View System Messages.
If you have frequent crashes and can't determine their cause, gather
all the information you can from the system console or the /var/adm/messages files, and have it ready for a customer service representative
to examine. For a complete list of troubleshooting information to gather for
your service provider, see Troubleshooting a System Crash.
If the system fails to reboot successfully after a system crash, see Chapter 29, Troubleshooting Miscellaneous Software Problems (Tasks).
-
Synchronize the disks and reboot.
If the system fails to reboot successfully after a system crash, see Chapter 29, Troubleshooting Miscellaneous Software Problems (Tasks).
Check to see if a system crash dump was generated after the system crash.
System crash dumps are saved by default. For information about crash dumps,
see Chapter 28, Managing System Crash Information (Tasks).
Gathering Troubleshooting Data
Answer the following questions to help isolate the system problem. Use Troubleshooting a System Crash Checklist for gathering troubleshooting data for a crashed
system.
Table 26–1 Identifying System Crash Data
|
Question
|
Description
|
|
Can you reproduce the
problem?
|
This is important because a reproducible test case is often essential for
debugging really hard problems. By reproducing the problem, the service provider
can build kernels with special instrumentation to trigger, diagnose, and fix
the bug.
|
|
Are you using any third-party
drivers?
|
Drivers run in the same address space as the kernel, with all the same privileges,
so they can cause system crashes if they have bugs.
|
|
What was the system
doing just before it crashed?
|
If the system was doing anything unusual like running a
new stress test or experiencing higher-than-usual load, that might have led
to the crash.
|
|
Were there any unusual
console messages right before the crash?
|
Sometimes the system will show signs of distress before
it actually crashes; this information is often useful.
|
|
Did you add any tuning
parameters to the /etc/system file?
|
Sometimes
tuning parameters, such as increasing shared memory segments so that the system
tries to allocate more than it has, can cause the system to crash.
|
|
Did the problem start
recently?
|
If so, did the onset of problems coincide with any changes to the system,
for example, new drivers, new software, different workload, CPU upgrade, or
a memory upgrade.
|
Troubleshooting a System Crash Checklist
Use this checklist when gathering system data for a crashed system.
|
Item
|
Your Data
|
|
Is a system crash dump available?
|
|
|
Identify the operating system
release and appropriate software application release levels.
|
|
|
Identify system hardware.
Include prtdiag output for sun4u systems. Include Explorer
output for other systems.
|
|
|
Are patches installed? If so,
include showrev -p output.
|
|
|
Is the problem reproducible?
|
|
|
Does the system have any third-party
drivers?
|
|
|
What was the system doing before
it crashed?
|
|
|
Were there any unusual console
messages right before the system crashed?
|
|
|
Did you add any parameters to
the /etc/system file?
|
|
|
Did the problem start recently?
|
|
Viewing System Messages
System messages display on the console device. The text of most system
messages look like this:
[ID msgid facility.priority]
For example:
[ID 672855 kern.notice] syncing file systems...
|
If the message originated in the kernel, the kernel module name is displayed.
For example:
Oct 1 14:07:24 mars ufs: [ID 845546 kern.notice] alloc: /: file system full
|
When a system crashes, it might display a message on the system console
like this:
Less frequently, this message might be displayed instead of the panic
message:
The error logging daemon, syslogd, automatically
records various system warnings and errors in message files. By default, many
of these system messages are displayed on the system console and are stored
in the /var/adm directory. You can direct where these
messages are stored by setting up system message logging. For more information,
see How to Customize System Message Logging. These messages can alert you to system
problems, such as a device that is about to fail.
The /var/adm directory contains several message
files. The most recent messages are in /var/adm/messages
file (and in messages.*), and the oldest are in the messages.3 file. After a period of time (usually every ten days),
a new messages file is created. The messages.0 file is renamed messages.1, messages.1 is renamed messages.2, and messages.2 is renamed messages.3. The current /var/adm/messages.3 file is deleted.
Because the /var/adm directory stores large files
containing messages, crash dumps, and other data, this directory can consume
lots of disk space. To keep the /var/adm directory from
growing too large, and to ensure that future crash dumps can be saved, you
should remove unneeded files periodically. You can automate this task by using
the crontab file. For more information on automating this
task, see How to Delete Crash Dump Files and Chapter 18, Scheduling System Tasks (Tasks).
How to View System Messages
Display recent messages generated by a system crash or reboot by using
the dmesg command.
Or, use the more command to display one screen of
messages at a time.
For more information, see dmesg(1M).
Example—Viewing System Messages
The following example shows output from the dmesg
command.
$ dmesg
Jan 3 08:44:41 starbug genunix: [ID 540533 kern.notice] SunOS Release 5.9 ...
Jan 3 08:44:41 starbug genunix: [ID 913631 kern.notice] Copyright 1983-2003 ...
Jan 3 08:44:41 starbug genunix: [ID 678236 kern.info] Ethernet address ...
Jan 3 08:44:41 starbug unix: [ID 389951 kern.info] mem = 131072K (0x8000000)
Jan 3 08:44:41 starbug unix: [ID 930857 kern.info] avail mem = 121888768
Jan 3 08:44:41 starbug rootnex: [ID 466748 kern.info] root nexus = Sun Ultra 5/
10 UPA/PCI (UltraSPARC-IIi 333MHz)
Jan 3 08:44:41 starbug rootnex: [ID 349649 kern.info] pcipsy0 at root: UPA 0x1f0x0
Jan 3 08:44:41 starbug genunix: [ID 936769 kern.info] pcipsy0 is /pci@1f,0
Jan 3 08:44:41 starbug pcipsy: [ID 370704 kern.info] PCI-device: pci@1,1, simba0
Jan 3 08:44:41 starbug genunix: [ID 936769 kern.info] simba0 is /pci@1f,0/pci@1,1
Jan 3 08:44:41 starbug pcipsy: [ID 370704 kern.info] PCI-device: pci@1, simba1
Jan 3 08:44:41 starbug genunix: [ID 936769 kern.info] simba1 is /pci@1f,0/pci@1
Jan 3 08:44:57 starbug simba: [ID 370704 kern.info] PCI-device: ide@3, uata0
Jan 3 08:44:57 starbug genunix: [ID 936769 kern.info] uata0 is /pci@1f,0/pci@1,
1/ide@3
Jan 3 08:44:57 starbug uata: [ID 114370 kern.info] dad0 at pci1095,6460
.
.
.
|
Customizing System Message Logging
You can capture additional error messages that are generated by various
system processes by modifying the /etc/syslog.conf file.
By default, the /etc/syslog.conf file directs many system
process messages to the /var/adm/messages files. Crash
and boot messages are stored here as well. To view /var/adm
messages, see How to View System Messages.
The /etc/syslog.conf file has two columns separated
by tabs:
facility.level ... action
|
|
facility.level
|
A facility
or system source of the message or condition. May be a comma-separated listed
of facilities. Facility values are listed in Table 26–2.
A level, indicates the severity or priority of
the condition being logged. Priority levels are listed in Table 26–3.
|
|
action
|
The action field indicates where the
messages are forwarded.
|
The following example shows sample lines from a default /etc/syslog.conf file.
user.err /dev/sysmsg
user.err /var/adm/messages
user.alert `root, operator'
user.emerg *
|
This means the following user messages are automatically logged:
-
User errors are printed to the console and also are logged
to the /var/adm/messages file.
-
User messages requiring immediate action (alert)
are sent to the root and operator users.
-
User emergency messages are sent to individual users.
The most common error condition sources are shown in the following table.
The most common priorities are shown in Table 26–3
in order of severity.
Table 26–2 Source Facilities for
syslog.conf Messages
|
Source
|
Description
|
|
kern
|
The kernel
|
|
auth
|
Authentication
|
|
daemon
|
All daemons
|
|
mail
|
Mail system
|
|
lp
|
Spooling system
|
|
user
|
User processes
|
Note –
The number of syslog facilities that can be
activated in the /etc/syslog.conf file is unlimited.
Table 26–3 Priority Levels for
syslog.conf Messages
|
Priority
|
Description
|
|
emerg
|
System emergencies
|
|
alert
|
Errors requiring immediate correction
|
|
crit
|
Critical errors
|
|
err
|
Other errors
|
|
info
|
Informational messages
|
|
debug
|
Output used for debugging
|
|
none
|
This setting doesn't log output
|
How to Customize System Message Logging
-
Become superuser.
-
Edit the /etc/syslog.conf file, adding or changing
message sources, priorities, and message locations according to the syntax
described in syslog.conf(4).
-
Exit the file, saving the changes.
Example—Customizing System Message Logging
This sample /etc/syslog.conf user.emerg facility sends user emergency messages to root and
individual users.
Enabling Remote Console Messaging
The following new console features improve your ability to troubleshoot
remote systems:
-
The consadm command enables you to select
a serial device as an auxiliary (or remote) console. Using the consadm command, a system administrator can configure one or more
serial ports to display redirected console messages and to host sulogin sessions when the system transitions between run levels.
This feature enables you to dial in to a serial port with a modem to monitor
console messages and participate in init state transitions.
(For more information, see sulogin(1M) and the step-by-step procedures that
follow.)
While you can log in to a system using a port configured as an auxiliary
console, it is primarily an output device displaying information that is also
displayed on the default console. If boot scripts or other applications read
and write to and from the default console, the write output displays on all
the auxiliary consoles, but the input is only read from the default console.
(For more information on using the consadm command during
an interactive login session, see Using the consadm Command During an Interactive
Login Session.)
-
Console output now consists of kernel and syslog messages written to a new pseudo device, /dev/sysmsg. In addition, rc script startup messages are
written to /dev/msglog. Previously, all of these messages
were written to /dev/console.
Scripts that direct console output to /dev/console
need to be changed to /dev/msglog if you want to see
script messages displayed on the auxiliary consoles. Programs referencing /dev/console should be explicitly modified to use syslog() or strlog() if you want messages
to be redirected to an auxiliary device.
-
The consadm command runs a daemon to monitor
auxiliary console devices. Any display device designated as an auxiliary console
that disconnects, hangs up or loses carrier, is removed from the auxiliary
console device list and is no longer active. Enabling one or more auxiliary
consoles does not disable message display on the default console; messages
continue to display on /dev/console.
Using Auxiliary Console Messaging During Run Level Transitions
Keep the following in mind when using auxiliary console messaging during
run level transitions:
-
Input cannot come from an auxiliary console if user input
is expected for an rc script that is run when a system
is booting. The input must come from the default console.
-
The sulogin program, invoked by init to prompt for the superuser password when transitioning between
run levels, has been modified to send the superuser password prompt to each
auxiliary device in addition to the default console device.
-
When the system is in single-user mode and one or more auxiliary
consoles are enabled using the consadm command, a console
login session runs on the first device to supply the correct superuser password
to the sulogin prompt. When the correct password is received
from a console device, sulogin disables input from all
other console devices.
-
A message is displayed on the default console and the other
auxiliary consoles when one of the consoles assumes single-user privileges.
This message indicates which device has become the console by accepting a
correct superuser password. If there is a loss of carrier on the auxiliary
console running the single-user shell, one of two actions might occur:
-
If the auxiliary console represents a system at run level
1, the system proceeds to the default run level.
-
If the auxiliary console represents a system at run level
S, the system displays the ENTER RUN LEVEL (0-6, s or S):
message on the device where the init s or shutdown command had been entered from the shell. If there isn't any carrier
on that device either, you will have to reestablish carrier and enter the
correct run level. The init or shutdown
command will not redisplay the run-level prompt.
-
If you are logged in to a system using a serial port, and
an init or shutdown command is issued
to transition to another run level, the login session is lost whether this
device is the auxiliary console or not. This situation is identical to Solaris
releases without auxiliary console capabilities.
-
Once a device is selected as an auxiliary console using the consadm command, it remains the auxiliary console until the system
is rebooted or the auxiliary console is unselected. However, the consadm command includes an option to set a device as the auxiliary
console across system reboots. (See the following procedure for step-by-step
instructions.)
Using the consadm Command During an Interactive
Login Session
If you want to run an interactive login session by logging in to a system
using a terminal that is connected to a serial port, and then using the consadm command to see the console messages from the terminal, note
the following behavior.
-
If you use the terminal for an interactive login session while
the auxiliary console is active, the console messages are sent to the /dev/sysmsg or /dev/msglog devices.
-
While you issue commands on the terminal, input goes to your
interactive session and not to the default console (/dev/console).
-
If you run the init command to change run
levels, the remote console software kills your interactive session and runs
the sulogin program. At this point, input is accepted only
from the terminal and is treated like it's coming from a console device. This
allows you to enter your password to the sulogin program
as described in Using Auxiliary Console Messaging During Run Level Transitions.
Then, if you enter the correct password on the (auxiliary) terminal,
the auxiliary console runs an interactive sulogin session,
locks out the default console and any competing auxiliary console. This means
the terminal essentially functions as the system console.
-
From here you can change to run level 3 or go to another run
level. If you change run levels, sulogin runs again on
all console devices. If you exit or specify that the system should come up
to run level 3, then all auxiliary consoles lose their ability to provide
input. They revert to being display devices for console messages.
As the system is coming up, you must provide information to rc scripts on the default console device. After the system comes
back up, the login program runs on the serial ports and
you can log back into another interactive session. If you've designated the
device to be an auxiliary console, you will continue to get console messages
on your terminal, but all input from the terminal goes to your interactive
session.
How to Enable an Auxiliary (Remote) Console
The consadm daemon does not start monitoring the
port until after you add the auxiliary console with the consadm
command. As a security feature, console messages are only redirected until
carrier drops, or the auxiliary console device is unselected. This means carrier
must be established on the port before you can successfully use the consadm command.
For more information on enabling an auxiliary console, see consadm(1M).
-
Log in to the system as superuser.
-
Enable the auxiliary console.
-
Verify that the current connection is the auxiliary console.
Example—Enabling an Auxiliary (Remote) Console
# consadm -a /dev/term/a
# consadm
/dev/term/a
|
How to Display a List of Auxiliary Consoles
-
Log in to the system as superuser.
-
Select one of the following steps:
-
Display the list of auxiliary consoles.
-
Display the list of persistent auxiliary consoles.
How to Enable an Auxiliary (Remote) Console Across System Reboots
-
Log in to the system as superuser.
-
Enable the auxiliary console across system reboots.
# consadm -a -p devicename
|
This adds the device to the list of persistent auxiliary consoles.
-
Verify that the device has been added to the list of persistent auxiliary
consoles.
Example—Enabling an Auxiliary (Remote) Console Across System
Reboots
# consadm -a -p /dev/term/a
# consadm
/dev/term/a
|
How to Disable an Auxiliary (Remote) Console
-
Log in to the system as superuser.
-
Select one of the following steps:
-
Disable the auxiliary console.
or
-
Disable the auxiliary console and remove it from the list of persistent
auxiliary consoles.
# consadm -p -d devicename
|
-
Verify that the auxiliary console has been disabled.
Example—Disabling an Auxiliary (Remote) Console
# consadm -d /dev/term/a
# consadm
|