Common Administration Tasks
  Rechercher uniquement dans ce livre
Télécharger cet ouvrage au format PDF

Enabling and Using Crash Dumps

8

This chapter contains these sections:
What Happens When a System Crashespage 71
What Is a Crash Dump?page 73
Enabling and Disabling Crash Dumpspage 73
Recovering From a Crashpage 76
Using a Crash Dumppage 78
Additional Diagnostic Techniquespage 78
A system may crash or hang so that it no longer responds to commands. You can set up a system so that it automatically saves an image of the kernel when the system crashes. This image is called a crash dump. You can use the information in the crash dump files to diagnose and troubleshoot the cause of the failure.
This chapter describes how to enable crash dumps, and how to use the files and messages to help determine the cause of the failure.

What Happens When a System Crashes

When a system crashes, it:
  • Aborts all running processes
  • Tries to save recent data changes
  • Displays an error message telling why it crashed
  • Tries to write out a crash dump to the disk
  • Tries to reboot the system
During operation, the system stores data in memory buffers, writing data to the disk only when necessary. If a system crashes, data stored in the buffers can be lost. To keep the system up to date, the operating system synchronizes the file system every 30 seconds. It runs the sync command, which updates the superblock and writes out any new information to the disk.
When a system crashes, data stored in memory may not have been completely written to disk. This can cause inconsistencies in file systems that must be repaired using fsck. See File System Administration and the fsck(1M) manual page for more information on fsck.

Error Messages Created by a Crash

When a system crashes, it displays a message like this:

  panic: error message  

where error message is one of the panic error messages described in the crash(1M) manual page.
Less frequently, this message may be displayed instead of the panic message:

  Watchdog reset !  

Crash messages are automatically stored in the /var/adm/messages file throughout the session. These messages are saved whether or not crash dumps are enabled for a system.
· How to Display Messages in /var/adm/messages
* Type dmesg and press Return. The contents of /var/adm/messages are displayed on the screen.
or
* Type more /var/adm/messages and press Return. The contents of /var/adm/messages are displayed on the screen
See Administration Supplement for Solaris Platforms for examples of the /var/adm/messages file and a description of the booting messages.

What Is a Crash Dump?

A crash dump is the image of the state of the kernel that was in physical memory when the system failed. The physical memory (or core file) is a snapshot of the kernel containing all of the program text, data, and control structures that are part of the operating system. When a system crashes, the physical memory is written to the end of the swap slice of the disk. Although the system writes a core file whenever it crashes, it does not save the crash dump file unless you configure the system to do so.

How Crash Dumps Are Created

With crash dumps enabled, when you reboot a system after a crash, the savecore program runs. savecore preserves a copy of the crash dump by writing it from the end of the swap slice into the directory /var/crash/systemname, where systemname is the name of the system. savecore incrementally saves the core image in the file vmcore.n and the namelist for the kernel in the file, unix.n. The n suffix is incremented each time savecore is run. As a result, the /var/crash/systemname directory can grow quite large on a system that crashes repeatedly.
Before savecore writes out a core image, it tries to determine the amount of available space left in the file system by reading the minfree file in the /var/crash/systemname directory. The minfree file contains a single ASCII number that represents the number of kilobytes of free space that must remain available in the file system. If saving the core file reduces the minimum free space to below the number in the /var/crash/systemname/minfree file, then savecore does not write out the crash dump. If the minfree file does not exist, savecore always writes out the core file, if one was created.
One way you can control the size of the /var/crash/systemname directory is to edit the minfree file and set the number large enough to prevent savecore from writing out the core file.
You can save a crash dump manually on a system with crash dumps disabled by running savecore as soon as the system has completed booting. If you do not run savecore immediately, the swap space containing the crash dump will be overwritten by programs. See the savecore(1M) manual page for more information.

Enabling and Disabling Crash Dumps

Crash dumps are not enabled by default. Using the crash dump output requires detailed knowledge of the kernel and how it works. You should enable crash dumps only on individual systems that are experiencing frequent system crashes. Once the problem is diagnosed and fixed, disable crash dumps for that system. In other words, do not enable crash dumps unless you plan to use them.
To enable crash dumps, you must modify the /etc/init.d/sysetup file for the system.
· How to Create a Directory for Saving the core File:
  1. Become superuser.

  2. Type cd /var and press Return.

  3. Type mkdir crash and press Return.

    The /var/crash directory is created.

  4. Type cd crash and press Return.

  5. Type mkdir system-name and press Return.

    A directory with the name of the system is created.

· How to Enable a Crash Dump
  1. Type vi /etc/init.d/sysetup and press Return.

  2. Uncomment the lines that enable the crash dumps by deleting the comment marks (#) from the beginning of those lines.

  3. Save the changes.

This example shows the appropriate section of the /etc/init.d/sysetup file edited to enable crash dumps:

  ##  
  ## Default is to not do a savecore  
  ##  
  If [ ! -d /var/crash/'uname -n' ]  
  then mkdir -p /var/crash/'uname -n'  
  fi  
       echo 'checking for crash dump...\c '  
  savecore /var/crash/'uname -n`  
       echo ''  

· How to Reserve File System Space
  1. Type cd /var/crash/system-name and press Return.

  2. Create a file named minfree and specify the minimum available free space (in kilobytes) that must remain available in the file system. For example, to reserve 5000 Kbytes of available free space, create a minfree file that looks like this:


  saturn% more /var/crash/saturn/minfree  
  5000  
  saturn%  

· How to Disable Crash Dumps
  1. Become superuser.

  2. Edit the /etc/init.d/sysetup file

  1. Insert a hash mark (#) at the beginning of each of the lines shown below:


  #if [ ! -d /var/crash/'uname -n' ]  
  #then mkdir -p /var/crash/'uname -n'  
  #fi  
  #                echo 'checking for crash dump...\c '  
  #savecore /var/crash/'uname -n'  
  #                echo ''  

  1. Save the changes.

  2. Type rm -rf /var/crash/system-name and press Return.

Recovering From a Crash

This section describes how to recover from a crash, what to do if rebooting fails, and how to force a crash dump.
When a system crashes, you need to bring it back up before you can look at the crash dump files. After a crash, the system may reboot automatically.

What to Do if a System Hangs

If a system hangs, use this checklist:
  • Make sure the pointer is in the window where you are typing the commands.
  • Press Control-q in case the user accidently pressed Control-s, which freezes the screen. Note that, in a windowing environment, Control-s freezes only the window, not the entire screen. If a window is frozen, try using another window.
  • Press Control-\ to force a "quit" in the running program and (probably) write out a core file.
  • Press Control-c to interrupt the program that may be running.
  • If possible, log onto the system from another terminal or remote login from another system on the network. Type ps -ef and look for the hung process. If it looks like the window system is hung, find the process and kill it.
  • Try becoming superuser and rebooting the system.
  • If the system still does not respond, force a crash dump and reboot. See Administration Supplement for Solaris Platforms for information on forcing a crash dump and booting.
  • If the system still does not respond, turn the power off, wait a minute or so, then turn the power back on. This procedure is frequently called power cycling.
  • If you cannot get the system to respond at all, contact your local service provider for help.

What to Do if Rebooting Fails

After a crash, the system may reboot automatically. If the automatic reboot fails with a message such as:

  reboot failed: help  

then run fsck in single-user mode.
If the system does not reboot, or if it reboots and then crashes again, there may be a hardware problem with a disk or one of the boards.
Check your hardware connections:
  • Make sure the equipment is plugged in.
  • Make sure all the switches are in the proper settings and pushed all the way in.
  • Look at all the connectors and cables, including the Ethernet cables.
  • If all this fails, turn off the power to the system, wait 10 to 20 seconds, and then turn on the power again.
If you cannot find any obvious fault with the connections, and the system still refuses to respond, contact your local service provider.

Before You Call for Help

Before calling for help, make sure you have accurately copied down crash messages from the console or taken them from the /var/adm/messages files.
If you are having frequent crashes, gather all the information you can about them and have it ready when you call for help.

Using a Crash Dump

Use the crash kernel debugger to examine the memory images of a live or crashed system kernel. You can examine the control structures, active tables, and other information about the operation of the kernel. The syntax of the command is:
/usr/sbin/crash [ -d dump-file ] [ -n name-list ] [ -w output-file ]

Only a few aspects of crash are useful to a system administrator. Completely describing the crash debugger is beyond the scope of this book. To use crash to its full potential requires a detailed knowledge of the kernel. Saved crash dumps can, however, be useful to send to a customer service representative for analysis. For details on the operation of the crash utility, see the crash(1M) manual page.

Additional Diagnostic Techniques

Log files and system messages provide information that can help determine what is wrong with a system that hangs, crashes, or does not reboot. This section describes how to read and use these messages.

Looking at Messages Generated During Booting

The /usr/sbin/dmesg command displays the error messages generated during booting. You can view these messages or redirect them to a file.
See Administration Supplement for Solaris Platforms for alternate methods of displaying boot messages.

Using System Error Logging (syslogd)

Many system facilities use the error logging daemon, syslogd, to record messages whenever an unusual event occurs. Typically, these messages are written to /var/adm/messages or to the system console. These messages can help you determine the cause of problems with a system. For example, an increasing number of error messages coming from a device may be an indication that the device is about to fail.

Setting Up System Logging

To set up system logging, you must have an /etc/syslog.conf file. This file has two columns: the first column specifies the source of the error condition and its priority; the second specifies the place where the errors are logged.
The message sources are specified by two parts separated by a dot (.). The first part is the source or facility, which describes the part of the system generating the message. The second part is the priority of the message. The most common sources are shown in Table 8-1. The most common priorities are shown in Table 8-2.
Table 8-1 syslog.conf
SourceMeaning
kernThe kernel
authAuthentication
daemonAll daemons
mailMail system
lpSpooling system
userUser processes

Note - There is a maximum of 24 syslog sources (or facilities) that can be activated in the /etc/syslog.conf file.

Table 8-2 syslog.conf
PriorityMeaning
errAll error output
debugDebugging output
noticeRoutine output
critCritical errors
emergSystem emergencies
noneDon't log output
For, example, the entries:

  user.err                       /dev/console  
  user.err                       /var/adm/messages  
  mail.debug                     /var/log/syslog  

show that user errors are printed to the console and are also logged to the file /var/adm/messages. Mail debugging output is logged to the file /var/log/syslog.
The following example shows the default /etc/syslog.conf file:

  #ident  "%Z%%M% %I%     %E% SMI"        /* SunOS 5.0 */  
  #  
  # Copyright (c) 1991-1993, by Sun Microsystems, Inc.  
  #  
  # syslog configuration file.  
  #  
  # This file is processed by m4 so be careful to quote ('') names  
  # that match m4 reserved words.  Also, within ifdef's, arguments  
  # containing commas must be quoted.  
  #  
  # Note: Have to exclude user from most lines so that user.alert  
  #       and user.emerg are not included, because old sendmails  
  #       will generate them for debugging information.  If you  
  #       have no 4.2BSD based systems doing network logging, you  
  #       can remove all the special cases for "user" logging.  
  #  


  *.err;kern.notice;auth.notice;user.none          /dev/console  
  *.err;kern.debug;daemon,auth.notice;mail.crit;user.none /var/adm/messages  
  
  *.alert;kern.err;daemon.err;user.none           operator  
  *.alert;user.none                               root  
  
  *.emerg;user.none                               *  
  
  # if a non-loghost machine chooses to have authentication messages  
  # sent to the loghost machine, un-comment out the following line:  
  #auth.notice                    ifdef('LOGHOST', /var/log/authlog, @loghost)  
  
  mail.debug                      ifdef('LOGHOST', /var/log/syslog, @loghost)  
  
  #  
  # non-loghost machines will use the following lines to cause "user"  
  # log messages to be logged locally.  
  #  
  ifdef('LOGHOST', ,  
  user.err                                        /dev/console  
  user.err                                        /var/adm/messages  
  user.alert                                      'root, operator'  
  user.emerg                                      *  
  )  

The /var/adm directory contains several message files. The most recent messages are in /var/adm/messages (and in messages.0), and the oldest are in messages.3. After a period of time (usually every ten days), a new messages file is created. The file messages.0 is renamed messages.1, messages.1 is renamed messages.2, and messages.2 is renamed messages.3. The current /var/adm/messages.3 is deleted.