Contained WithinFind More DocumentationFeatured Support Resources | Download this book in PDF (1152 KB)
Chapter 9 Debugging With the Kernel Memory AllocatorThe Solaris kernel memory (kmem) allocator provides a powerful set of debugging features that can facilitate analysis of a kernel crash dump. This chapter discusses these debugging features, and the MDB dcmds and walkers designed specifically for the allocator. Bonwick (see Related Books and Papers) provides an overview of the principles of the allocator itself. Refer to the header file <sys/kmem_impl.h> for the definitions of allocator data structures. The kmem debugging features can be enabled on a production system to enhance problem analysis, or on development systems to aid in debugging kernel software and device drivers. Note – MDB exposes kernel implementation details that are subject to change at any time. This guide reflects the Solaris kernel implementation as of the date of publication of this guide. Information provided in this guide about the kernel memory allocator might not be correct or applicable to past or future Solaris releases. Getting Started: Creating a Sample Crash DumpThis section shows you how to obtain a sample crash dump, and how to invoke MDB in order to examine it. Setting kmem_flagsThe kernel memory allocator contains many advanced debugging features, but these are not enabled by default because they can cause performance degradation. In order to follow the examples in this guide, you should turn on these features. You should enable these features only on a test system, as they can cause performance degradation or expose latent problems. The allocator's debugging functionality is controlled by the kmem_flags tunable. To get started, make sure kmem_flags is set properly: # mdb -k > kmem_flags/X kmem_flags: kmem_flags: f If kmem_flags is not set to 'f', you should add the line: set kmem_flags=0xf to /etc/system and reboot the system. When the system reboots, confirm that kmem_flags is set to 'f'. Remember to remove your /etc/system modifications before returning this system to production use. Forcing a Crash DumpThe next step is to make sure crash dumps are properly configured. First, confirm that dumpadm is configured to save kernel crash dumps and that savecore is enabled. See dumpadm(1M) for more information on crash dump parameters. # dumpadm Dump content: kernel pages Dump device: /dev/dsk/c0t0d0s1 (swap) Savecore directory: /var/crash/testsystem Savecore enabled: yes Next, reboot the system using the '-d' flag to reboot(1M), which forces the kernel to panic and save a crash dump. # reboot -d
Sep 28 17:51:18 testsystem reboot: rebooted by root
panic[cpu0]/thread=70aacde0: forced crash dump initiated at user request
401fbb10 genunix:uadmin+55c (1, 1, 0, 6d700000, 5, 0)
%l0-7: 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
...
When the system reboots, make sure the crash dump succeeded: $ cd /var/crash/testsystem $ ls bounds unix.0 unix.1 vmcore.0 vmcore.1 If the dump is missing from your dump directory, it could be that the partition is out of space. You can free up space and run savecore(1M) manually as root to subsequently save the dump. If your dump directory contains multiple crash dumps, the one you just created will be the unix.[n] and vmcore.[n] pair with the most recent modification time. Starting MDBNow, run mdb on the crash dump you created, and check its status: $ mdb unix.1 vmcore.1 Loading modules: [ unix krtld genunix ip nfs ipc ] > ::status debugging crash dump vmcore.1 (32-bit) from testsystem operating system: 5.10 Generic (sun4u) panic message: forced crash dump initiated at user request In the examples presented in this guide, a crash dump from a 32-bit kernel is used. All of the techniques presented here are applicable to a 64-bit kernel, and care has been taken to distinguish pointers (sized differently on 32- and 64-bit systems) from fixed-sized quantities, which are invariant with respect to the kernel data model. An UltraSPARC workstation was used to generate the example presented. Your results can vary depending on the architecture and model of system you use. Allocator BasicsThe kernel memory allocator's job is to parcel out regions of virtual memory to other kernel subsystems (these are commonly called clients). This section explains the basics of the allocator's operation and introduces some terms used later in this guide. Buffer StatesThe functional domain of the kernel memory allocator is the set of buffers of virtual memory that make up the kernel heap. These buffers are grouped together into sets of uniform size and purpose, known as caches. Each cache contains a set of buffers. Some of these buffers are currently free, which means that they have not yet been allocated to any client of the allocator. The remaining buffers are allocated, which means that a pointer to that buffer has been provided to a client of the allocator. If no client of the allocator holds a pointer to an allocated buffer, this buffer is said to be leaked, because it cannot be freed. Leaked buffers indicate incorrect code that is wasting kernel resources. TransactionsA kmem transaction is a transition on a buffer between the allocated and free states. The allocator can verify that the state of a buffer is valid as part of each transaction. Additionally, the allocator has facilities for logging transactions for post-mortem examination. Sleeping and Non-Sleeping AllocationsUnlike the Standard C Library's malloc(3C) function, the kernel memory allocator can block (or sleep), waiting until enough virtual memory is available to satisfy the client's request. This is controlled by the 'flag' parameter to kmem_alloc(9F). A call to kmem_alloc(9F) which has the KM_SLEEP flag set can never fail; it will block forever waiting for resources to become available. Kernel Memory CachesThe kernel memory allocator divides the memory it manages into a set of caches. All allocations are supplied from these caches, which are represented by the kmem_cache_t data structure. Each cache has a fixed buffer size, which represents the maximum allocation size satisfied by that cache. Each cache has a string name indicating the type of data it manages.
Some kernel memory caches are special purpose and are initialized
to allocate only a particular kind of data structure. An example of this
is the “thread_cache,” which allocates only structures of type Note – kmem_cache_alloc() and kmem_cache_free() are not public DDI interfaces. Do NOT write code that relies on them, because they are subject to change or removal in future releases of Solaris. Caches whose name begins with “kmem_alloc_” implement the kernel's general memory allocation scheme. These caches provide memory to clients of kmem_alloc(9F) and kmem_zalloc(9F). Each of these caches satisfies requests whose size is between the buffer size of that cache and the buffer size of the next smallest cache. For example, the kernel has kmem_alloc_8 and kmem_alloc_16 caches. In this case, the kmem_alloc_16 cache handles all client requests for 9-16 bytes of memory. Remember that the size of each buffer in the kmem_alloc_16 cache is 16 bytes, regardless of the size of the client request. In a 14 byte request, two bytes of the resulting buffer are unused, since the request is satisfied from the kmem_alloc_16 cache. The last set of caches are those used internally by the kernel memory allocator for its own bookkeeping. These include those caches whose names start with “kmem_magazine_” or “kmem_va_”, the kmem_slab_cache, the kmem_bufctl_cache and others. Kernel Memory CachesThis section explains how to find and examine kernel memory caches. You can learn about the various kmem caches on the system by issuing the ::kmastat command. > ::kmastat cache buf buf buf memory alloc alloc name size in use total in use succeed fail ------------------------- ------ ------ ------ --------- --------- ----- kmem_magazine_1 8 24 1020 8192 24 0 kmem_magazine_3 16 141 510 8192 141 0 kmem_magazine_7 32 96 255 8192 96 0 ... kmem_alloc_8 8 3614 3751 90112 9834113 0 kmem_alloc_16 16 2781 3072 98304 8278603 0 kmem_alloc_24 24 517 612 24576 680537 0 kmem_alloc_32 32 398 510 24576 903214 0 kmem_alloc_40 40 482 584 32768 672089 0 ... thread_cache 368 107 126 49152 669881 0 lwp_cache 576 107 117 73728 182 0 turnstile_cache 36 149 292 16384 670506 0 cred_cache 96 6 73 8192 2677787 0 ... If you run ::kmastat you get a feel for what a “normal” system looks like. This will help you to spot excessively large caches on systems that are leaking memory. The results of ::kmastat will vary depending on the system you are running on, how many processes are running, and so forth. Another way to list the various kmem caches is with the ::kmem_cache command: > ::kmem_cache ADDR NAME FLAG CFLAG BUFSIZE BUFTOTL 70036028 kmem_magazine_1 0020 0e0000 8 1020 700362a8 kmem_magazine_3 0020 0e0000 16 510 70036528 kmem_magazine_7 0020 0e0000 32 255 ... 70039428 kmem_alloc_8 020f 000000 8 3751 700396a8 kmem_alloc_16 020f 000000 16 3072 70039928 kmem_alloc_24 020f 000000 24 612 70039ba8 kmem_alloc_32 020f 000000 32 510 7003a028 kmem_alloc_40 020f 000000 40 584 ... This command is useful because it maps cache names to addresses, and provides the debugging flags for each cache in the FLAG column. It is important to understand that the allocator's selection of debugging features is derived on a per-cache basis from this set of flags. These are set in conjunction with the global kmem_flags variable at cache creation time. Setting kmem_flags while the system is running has no effect on the debugging behavior, except for subsequently created caches (which is rare after boot-up). Next, walk the list of kmem caches directly using MDB's kmem_cache walker: > ::walk kmem_cache 70036028 700362a8 70036528 700367a8 ... This produces a list of pointers that correspond to each kmem cache in the kernel. To find out about a specific cache, apply the kmem_cache macro: > 0x70039928$<kmem_cache
0x70039928: lock
0x70039928: owner/waiters
0
0x70039930: flags freelist offset
20f 707c86a0 24
0x7003993c: global_alloc global_free alloc_fail
523 0 0
0x70039948: hash_shift hash_mask hash_table
5 1ff 70444858
0x70039954: nullslab
0x70039954: cache base next
70039928 0 702d5de0
0x70039960: prev head tail
707c86a0 0 0
0x7003996c: refcnt chunks
-1 0
0x70039974: constructor destructor reclaim
0 0 0
0x70039980: private arena cflags
0 104444f8 0
0x70039994: bufsize align chunksize
24 8 40
0x700399a0: slabsize color maxcolor
8192 24 32
0x700399ac: slab_create slab_destroy buftotal
3 0 612
0x700399b8: bufmax rescale lookup_depth
612 1 0
0x700399c4: kstat next prev
702c8608 70039ba8 700396a8
0x700399d0: name kmem_alloc_24
0x700399f0: bufctl_cache magazine_cache magazine_size
70037ba8 700367a8 15
...
Important fields for debugging include 'bufsize', 'flags' and 'name'. The name of the kmem_cache (in this case “kmem_alloc_24”) indicates its purpose in the system. Bufsize indicates the size of each buffer in this cache; in this case, the cache is used for allocations of size 24 and smaller. 'flags' indicates what debugging features are turned on for this cache. You can find the debugging flags listed in <sys/kmem_impl.h>. In this case 'flags' is 0x20f, which is KMF_AUDIT | KMF_DEADBEEF | KMF_REDZONE | KMF_CONTENTS | KMF_HASH. This document explains each of the debugging features in subsequent sections. When you are interested in looking at buffers in a particular cache, you can walk the allocated and freed buffers in that cache directly: > 0x70039928::walk kmem 704ba010 702ba008 704ba038 702ba030 ... > 0x70039928::walk freemem 70a9ae50 70a9ae28 704bb730 704bb2f8 ... MDB provides a shortcut to supplying the cache address to the kmem walker: a specific walker is provided for each kmem cache, and its name is the same as the name of the cache. For example: > ::walk kmem_alloc_24 704ba010 702ba008 704ba038 702ba030 ... > ::walk thread_cache 70b38080 70aac060 705c4020 70aac1e0 ... Now you know how to iterate over the kernel memory allocator's internal data structures and examine the most important members of the kmem_cache data structure. Detecting Memory CorruptionOne of the primary debugging facilities of the allocator is that it includes algorithms to recognize data corruption quickly. When corruption is detected, the allocator immediately panics the system. This section describes how the allocator recognizes data corruption; you must understand this to be able to debug these problems. Memory abuse typically falls into one of the following categories:
Keep these problems in mind as you read the next three sections. They will help you to understand the allocator's design, and enable you to diagnose problems more efficiently. Freed Buffer Checking: 0xdeadbeefWhen the KMF_DEADBEEF (0x2) bit is set in the flags field of a kmem_cache, the allocator tries to make memory corruption easy to detect by writing a special pattern into all freed buffers. This pattern is 0xdeadbeef. Since a typical region of memory contains both allocated and freed memory, sections of each kind of block will be interspersed; here is an example from the “kmem_alloc_24” cache: 0x70a9add8: deadbeef deadbeef 0x70a9ade0: deadbeef deadbeef 0x70a9ade8: deadbeef deadbeef 0x70a9adf0: feedface feedface 0x70a9adf8: 70ae3260 8440c68e 0x70a9ae00: 5 4ef83 0x70a9ae08: 0 0 0x70a9ae10: 1 bbddcafe 0x70a9ae18: feedface 139d 0x70a9ae20: 70ae3200 d1befaed 0x70a9ae28: deadbeef deadbeef 0x70a9ae30: deadbeef deadbeef 0x70a9ae38: deadbeef deadbeef 0x70a9ae40: feedface feedface 0x70a9ae48: 70ae31a0 8440c54e The buffer beginning at 0x70a9add8 is filled with the 0xdeadbeef pattern, which is an immediate indication that the buffer is currently free. At 0x70a9ae28 another free buffer begins; at 0x70a9ae00 an allocated buffer is located between them. Note – You might have observed that there are some holes on this picture, and that 3 24–byte regions should occupy only 72 bytes of memory, instead of the 120 bytes shown here. This discrepancy is explained in the next section Redzone: 0xfeedface. Redzone: 0xfeedfaceThe pattern 0xfeedface appears frequently in the buffer above. This pattern is known as the “redzone” indicator. It enables the allocator (and a programmer debugging a problem) to determine if the boundaries of a buffer have been violated by “buggy” code. Following the redzone is some additional information. The contents of that data depends upon other factors (see Memory Allocation Logging). The redzone and its suffix are collectively called the buftag region. Figure 9–1 summarizes this information. Figure 9–1 The Redzone
The buftag is appended to each buffer in a cache when any of the KMF_AUDIT, KMF_DEADBEEF, or KMF_REDZONE flags are set in that buffer's cache. The contents of the buftag depend on whether KMF_AUDIT is set. Decomposing the memory region presented above into distinct buffers is now simple: 0x70a9add8: deadbeef deadbeef \ 0x70a9ade0: deadbeef deadbeef +- User Data (free) 0x70a9ade8: deadbeef deadbeef / 0x70a9adf0: feedface feedface -- REDZONE 0x70a9adf8: 70ae3260 8440c68e -- Debugging Data 0x70a9ae00: 5 4ef83 \ 0x70a9ae08: 0 0 +- User Data (allocated) 0x70a9ae10: 1 bbddcafe / 0x70a9ae18: feedface 139d -- REDZONE 0x70a9ae20: 70ae3200 d1befaed -- Debugging Data 0x70a9ae28: deadbeef deadbeef \ 0x70a9ae30: deadbeef deadbeef +- User Data (free) 0x70a9ae38: deadbeef deadbeef / 0x70a9ae40: feedface feedface -- REDZONE 0x70a9ae48: 70ae31a0 8440c54e -- Debugging Data In the free buffers at 0x70a9add8 and 0x70a9ae28, the redzone is filled with 0xfeedfacefeedface. This a convenient way of determining that a buffer is free. In the allocated buffer beginning at 0x70a9ae00, the situation is different. Recall from Allocator Basics that there are two allocation types: 1) The client requested memory using kmem_cache_alloc(), in which case the size of the requested buffer is equal to the bufsize of the cache. 2) The client requested memory using kmem_alloc(9F), in which case the size of the requested buffer is less than or equal to the bufsize of the cache. For example, a request for 20 bytes will be fulfilled from the kmem_alloc_24 cache. The allocator enforces the buffer boundary by placing a marker, the redzone byte, immediately following the client data: 0x70a9ae00: 5 4ef83 \ 0x70a9ae08: 0 0 +- User Data (allocated) 0x70a9ae10: 1 bbddcafe / 0x70a9ae18: feedface 139d -- REDZONE 0x70a9ae20: 70ae3200 d1befaed -- Debugging Data 0xfeedface at 0x70a9ae18 is followed by a 32-bit word containing what seems to be a random value. This number is actually an encoded representation of the size of the buffer. To decode this number and find the size of the allocated buffer, use the formula: size = redzone_value / 251 So, in this example, size = 0x139d / 251 = 20 bytes. This indicates that the buffer requested was of size 20 bytes. The allocator performs this decoding operation and finds that the redzone byte should be at offset 20. The redzone byte is the hex pattern 0xbb, which is present at 0x729084e4 (0x729084d0 + 0t20) as expected. Figure 9–2 Sample kmem_alloc(9F) Buffer
Figure 9–3 shows the general form of this memory layout. Figure 9–3 Redzone Byte
If the allocation size is the same as the bufsize of the cache, the redzone byte overwrites the first byte of the redzone itself, as shown in Figure 9–4. Figure 9–4 Redzone Byte at the Beginning of the Redzone
This overwriting results in the first 32-bit word of the redzone being 0xbbedface, or 0xfeedfabb depending on the endianness of the hardware on which the system is running. Note – Why is the allocation size encoded this way? To encode the size, the allocator uses the formula (251 * size + 1). When the size decode occurs, the integer division discards the remainder of '+1'. However, the addition of 1 is valuable because the allocator can check whether the size is valid by testing whether (size % 251 == 1). In this way, the allocator defends against corruption of the redzone byte index. Uninitialized Data: 0xbaddcafeYou might be wondering what the suspicious 0xbbddcafe at address 0x729084d4 was before the redzone byte got placed over the first byte in the word. It was 0xbaddcafe. When the KMF_DEADBEEF flag is set in the cache, allocated but uninitialized memory is filled with the 0xbaddcafe pattern. When the allocator performs an allocation, it loops across the words of the buffer and verifies that each word contains 0xdeadbeef, then fills that word with 0xbaddcafe. A system can panic with a message such as: panic[cpu1]/thread=e1979420: BAD TRAP: type=e (Page Fault) rp=ef641e88 addr=baddcafe occurred in module "unix" due to an illegal access to a user address In this case, the address that caused the fault was 0xbaddcafe: the panicking thread has accessed some data that was never initialized. Associating Panic Messages With FailuresThe kernel memory allocator emits panic messages corresponding to the failure modes described earlier. For example, a system can panic with a message such as: kernel memory allocator: buffer modified after being freed modification occurred at offset 0x30 The allocator was able to detect this case because it tried to validate that the buffer in question was filled with 0xdeadbeef. At offset 0x30, this condition was not met. Since this condition indicates memory corruption, the allocator panicked the system. Another example failure message is: kernel memory allocator: redzone violation: write past end of buffer The allocator was able to detect this case because it tried to validate that the redzone byte (0xbb) was in the location it determined from the redzone size encoding. It failed to find the signature byte in the correct location. Since this indicates memory corruption, the allocator panicked the system. Other allocator panic messages are discussed later. Memory Allocation LoggingThis section explains the logging features of the kernel memory allocator and how you can employ them to debug system crashes. Buftag Data IntegrityAs explained earlier, the second half of each buftag contains extra information about the corresponding buffer. Some of this data is debugging information, and some is data private to the allocator. While this auxiliary data can take several different forms, it is collectively known as “Buffer Control” or bufctl data. However, the allocator needs to know whether a buffer's bufctl pointer is valid, since this pointer might also have been corrupted by malfunctioning code. The allocator confirms the integrity of its auxiliary pointer by storing the pointer and an encoded version of that pointer, and then cross-checking the two versions. As shown in Figure 9–5, these pointers are the bcp (buffer control pointer) and bxstat (buffer control XOR status). The allocator arranges bcp and bxstat so that the expression bcp XOR bxstat equals a well-known value. Figure 9–5 Extra Debugging Data in the Buftag
In the event that one or both of these pointers becomes corrupted, the allocator can easily detect such corruption and panic the system. When a buffer is allocated, bcp XOR bxstat = 0xa110c8ed (“allocated”). When a buffer is free, bcp XOR bxstat = 0xf4eef4ee (“freefree”). Note – You might find it helpful to re-examine the example provided in Freed Buffer Checking: 0xdeadbeef, in order to confirm that the buftag pointers shown there are consistent. In the event that the allocator finds a corrupt buftag, it panics the system and produces a message similar to the following: kernel memory allocator: boundary tag corrupted
bcp ^ bxstat = 0xffeef4ee, should be f4eef4ee
Remember, if bcp is corrupt, it is still possible to retrieve its value by taking the value of bxstat XOR 0xf4eef4ee or bxstat XOR 0xa110c8ed, depending on whether the buffer is allocated or free. The
|