Содержащиеся вНайти другие документыРесурсы поддержки | Загрузить это руководство в формате PDF (1284 КБ)
Chapter 5 Troubleshooting Directory Server ProblemsThis chapter describes how to troubleshoot general problems with Directory Server. It includes information about the following topics: Troubleshooting a CrashThis section describe how to begin troubleshooting a crashed Directory Server process. It describes possible causes of a crash, what pieces of information you need to collect to help identify the problem, and how to analyze the information you collect. Possible Causes of a CrashA crash could be caused by one or more of the following:
If a Directory Server process crashes, you need to open a service request with the Sun Support Center. Collecting Data About a CrashThis section describes the data you need to collect when the server crashes. The most critical data to collect is the core file. Note – If you contact the Sun Support Center about a crashed Directory Server process, you must provide a core file and logs. Generating a Core FileCore file and crash dumps are generated when a process or application terminates abnormally. You must configure your system to allow Directory Server to generate a core file if the server crashes. The core file contains a snapshot of the Directory Server process at the time of the crash, and can be indispensable in determining what led to the crash. Core files are written to the same directory as the errors logs, by default, instance-path/logs/. Core files can be quite large, as they include the entry cache. If a core file was not generated automatically, you can configure your operating system to allow core dumping by using the commands described in the following table and then waiting for the next crash to retrieve the data.
For example, on Solaris OS, you enable applications to generate core files using the following command:
The path-to-file specifies the full path to the core file you want to generate. The file will be named using the executable file name (%f), the system node name (%n), and the process ID (%p). If after enabling core file generation your system still does not create a core file, you may need to change the file-size writing limits set by your operating system. Use the ulimit command to change the maximum core file size and maximum stack segment size as follows:
Check that the limits are set correctly using the -a option as follows:
For information about configuring core file generate on Red Hat Linux and Windows, see Configuring the Operating System to Generate Core Files in Sun Gathering Debug Data for Sun Java System Directory Server 5. Next, verify that applications can generate core files using the kill -11 process-id command. The cores should be generated in either the specified directory or in the default instance-name/logs directory.
Getting the Core and Shared LibrariesGet all the libraries and binaries associated with the slapd process for core file analysis. Collect the libraries using the pkg_app script . The pkg_app script packages an executable and all of its shared libraries into one compressed tar file. You provide the process ID of the application and, optionally, the name of the core file to be opened. For more information about the pkg_app script see Using the pkg_app Script on Solaris. As superuser, run the pkg_app script as follows:
Note – You can also run the pkg_app script without a core file. This reduces the size of the script's output. You need to later set the variable to the correct location of the core file. Additional InformationTo look at the log files created at the time the problem occurred, check the following files:
If the crash is related to the operating system running out of disk or memory, retrieve the system logs. For example, on Solaris OS check the /var/adm/messages file and the /var/log/syslogs file for hardware or memory failures. To get complete version output, use the following commands:
Analyzing Crash DataWhenever the Directory Server crashes, it generates a core. With this core file and the process stack of the core file you obtained from the ns-slapd binary directory, you can analyze the problem. This section describes how to analyze the core file crash data on a Solaris OS. Examining a Core File on SolarisOnce you have obtained a core file, run the pstack and pmap Solaris utilities on the file. The pmap utility shows the process map, which includes a list of virtual addresses, where the dynamic libraries are loaded, and where the variables are declared. The pstack utility shows the process stack. For each thread in the process, it describes the exact stack of instruction the thread was executing at the moment when the process died or when the pstack command was executed. These utilities must be run from the directory that contains the ns-slapd binary, root-dir/bin/slapd/server. Run the utilities as follows:
If the results of the pstack utility are almost empty, all of the lines in the output look as follows:
If your pstack output looks like this, confirm that you are running the utilities from the ns-slapd binary directory. If you did not run the utility from the ns-slapd binary directory, then go to the directory and rerun the utility. You can also use the mdb command instead of the pstack command to know the stack of the core. Run the mdb command as follows:
The output of the mdb and the pstack commands provide helpful information about the process stack at the time of the crash. The mdb $C command output provides the exact thread that caused the crash. On Solaris 8 and 9, the first thread of the pstack output often contains the thread responsible for the crash. On Solaris 10, use mdb to find the crashing thread or, if using the pstack command, analyze the stack by looking for threads that do not contain lwp-park, poll, and pollsys. For example, the following core process stack occurs during the call of a plug-in function:
When analyzing process stacks from cores, concentrate on the operations in the middle of the thread. Processes at the bottom are too general and processes at the top are too specific. The commands in the middle of the thread are specific to the Directory Server and can thus help you identify at which point during processing the operation failed. In the above example, we see the plugin_call_exop_plugins process call indicates a problem calling an external operation in the custom plug-in. If the problem is related to the Directory Server, you can use the function call that seems like the most likely cause of the problem to search on SunSolve for known problems associated with this function call. SunSolve is located at http://sunsolve.sun.com/. If you do locate a problem related to the one you are experiencing, confirm that it applies to the version of Directory Server that you are running. To get information about the version you are running, use the following command:
If after doing a basic analysis of your core files you cannot identify the problem, collect the binaries and libraries using the pkg_app script and contact the Sun Support Center. Troubleshooting an Unresponsive ProcessThe type of performance problem you are experiencing depends on the level of CPU available as described in the following table. The first step in troubleshooting a Directory Server that is still running but no longer responding to client application requests is to identify which of the three types of performance issue it corresponds to. Table 5–1 CPU Level Associated With Performance Problems
The remainder of this section describes the following troubleshooting procedures: Symptoms of an Unresponsive ProcessIf your error log contains errors about not being able to open file descriptors, this is usually a symptom of an unresponsive process. For example, the error log may contain a message such as the following:
Other symptoms of an unresponsive process include LDAP connections that do not answer or that hang, no messages in the error or access logs, or an access log that is never updated. Collecting Data About an Unresponsive ProcessThe prstat tool tells you the amount of CPU being used for each thread. If you collect a process stack using the pstack utility at the same time you run the prstat tool, you can then use the pstack output to see what the thread was doing when it had trouble. If you run the prstat and pstack simultaneously several times, then you can see over time if the same thread was causing the problem and if it was encountering the problem during the same function call. If you are experiencing a performance drop, then run the commands simultaneously every 2 seconds. If you are experiencing a passive or active hang, run the commands with a slightly longer delay, for example every 10 seconds or so. Analyzing Data About a Unresponsive Process: an ExampleFor example, you try running an ldapsearch on your Directory Server as follows:
This command generates a 40 second search with no results. To analyze why the process in unresponsive, first get the process ID using the following command:
Next, rerun the search and during the search run the prstat and pstack commands simultaneously for the Directory Server process, which in the output above has a process ID of 14993.
We rerun the commands three times, with an interval of two seconds between each consecutive run. The output of the first prstat command appears as follows:
The problem appears to be occurring in thread 51. Next, we look for thread 51 in the output of the first pstack command and it appears as follows:
Note – The ends of the lines in this example have been wrapped so that they fit on the page. The output of the second and third pstack command show the same results, with thread 51 doing the same types of operation. All three pstack outputs taken at two second intervals show thread 51 doing the same search operations. The first parameter of the op_shared_search function contains the address of the operations taking place, which is 101cfcb90. The same operation occurs in each of the three stacks, meaning that the same search is taking place during the four seconds that elapsed between the first and the last pstack run. Moreover, the prstat output always shows thread 51 as the thread taking the highest amount of CPU. If you check the access log for the result of the search operations at the time the hang was observed, we find that it is a result of the search on the unindexed description entry. By creating a description index, this hang will be avoided. Troubleshooting Drops in PerformanceThis section describes how to begin troubleshooting a drop in performance. It describes possible causes of performance drops, describes the information you need to consult if you experience a performance drop, and how to analyze this information. Possible Causes of a Drop in PerformanceMake certain that you have not mistaken an active or passive hang for a performance drop. If you are experiencing a performance drop, it could be for one of the following reasons:
Collecting Data About a Drop in PerformanceCollect information about disk, CPU, memory, and process stack use during the period in which performance is dropping. Collecting Disk, CPU, and Memory StatisticsIf your CPU is very low (at or around 10%), try to determine if the problem is network related using the netstat command as follows:
A performance drop may be the result of the network if a client is not receiving information despite the fact that access logs show that results work sent immediately. Running the ping andtraceroute commands can help you determine if network latency is responsible for the problem. Collect swap information to see if you are running out of memory. Memory may be your problem if the output of the swap command is small.
On Solaris, use the output of the prstat command to identify if other processes could be impacting the system performance. On Linux and HP-UX, use the top command. Collecting Consecutive Process Stacks on SolarisCollect consecutive pstack and prstat output of the Directory Server during the period when the performance drops as described in Analyzing Data About a Unresponsive Process: an Example. For example, you could use the following script on Solaris to gather pstack and prstat information:
Using the idsktune CommandThe idsktune command provides information about system parameters, patch level, and tuning recommendations. You can use the output of this command to detect problems in thread libraries or patches that are missing. For more information about the idsktune command, see idsktune(1M). Analyzing Data Collected About a Performance ProblemIn general, look through your data for patterns and commonalities in the errors encountered. For example, if all operation problems are associated with searches to static groups, modifies to static groups, and searches on roles, this indicates that Directory Server is not properly tuned to handle these expensive operations. For example, the nsslapd-search-tune attribute is not configured correctly for static group related searches, or maybe the uniqueMember attribute indexed in a substring affects the group related updates. If you notice that problems are associated with unrelated operations but all at a particular time, this might indicate a memory access problem or a disk access problem. You can take information culled from you pstacks to SunSolve and search for them along with the phrase unresponsive events to see if anything similar to your problem has already been encountered and solved. SunSolve is located at http://sunsolve.sun.com/pub-cgi/show.pl?target=tous The remainder of this section provides additional tips to help you analyze the data you collected in the previous steps. Analyzing the Access Log Using the logconv CommandYou can use the logconv command to analyze the Directory Server access logs. This command extracts usage statistics and counts the occurrences of significant events. For more information about this tool, see logconv(1). For example, run the logconv command as follows:
Check the output file for the following:
Identifying Capacity Limitations: an ExerciseOften a capacity limitation manifests itself as a performance issue. To differentiate between performance and capacity, performance might be defined as “How fast the system is going” while capacity is “the maximum performance of the system or an individual component.” If your CPU is very low (at or around 10%), try to determine if the disk controllers are fully loaded and if input/output is the cause. To determine if your problem is disk related, use the iostat tool as follows:
For example, a directory is available on the internet. Their customers submit searches from multiple sites and the Service Level Agreement (SLA) was no more than 5% of requests with response times of over 3 seconds. Currently 15% of request take more than 3 seconds, which puts the business in a penalty situation. The system is a 6800 with 12x900MHz CPUs. The vmstat output looks as follows:
We look at the right 3 columns, us=user, sy=system and id=idle, which show that over 50% of the CPU is idle and available for the performance problem. One way to detect a memory problem is to look at the sr, or scan rate, column of the vmstat output. If the page scanner ever starts running, or the scan rate gets over 0, then we need to look more closely at the memory system. The odd part of this display is that the blocked queue on the left of the display has 18 or 19 processes in it but there are no processes in the run queue. This suggests that the process is blocking somewhere in Solaris without using all of the available CPU. Next, we look at the I/O subsystem. With Solaris 8, the iostat command has a switch, -C, which will aggregate I/Os at the controller level. We run the iostat command as follows:
On controller 1 we are doing 396 reads per second and on controller 3 we are doing 400 reads per second. On the right side of the data, we see that the output shows the controller is almost 200% busy. So the individual disks are doing almost 200 reads per second and the output shows the disks as 100% busy. That leads us to a rule of thumb that individual disks perform at approximately 150 I/Os per second. This does not apply to LUNs or LDEVs from the big disk arrays. So our examination of the numbers leads us to suggest adding 2 disks to each controller and relaying out the data. In this exercise we looked at all the numbers and attempted to locate the precise nature of the problem. Do not assume adding CPUs and memory will fix all performance problems. In this case, the search programs were exceeding the capacity of the disk drives which manifested itself as a performance problem of transactions with extreme response times. All those CPUs were waiting on the disk drives. Troubleshooting Process HangsThis section describes how to troubleshoot a totally unresponsive Directory Server process. A totally unresponsive process is called a hang, and there are two types of hang you might experience:
The remainder of this section describes how to troubleshoot each of these types of process hang. Troubleshooting an Active HangA hang is active if the top or vmstat 1 output show CPU levels of over 95%. This section describes the causes of an active hang, how to collect information about an active hang, and out to analyze this data. Possible Causes of an Active HangPossible causes of an active hang include the following:
Collecting and Analyzing Data About an Active HangOn a Solaris system, collect several traces of the Directory Server process stack that is hanging using the Solaris pstack utility. Run the command from the root-dir/bin/slapd/server directory. You should also collect statistics about the active process using the Solaris prstat utility. You must collect this information while the server is hanging. The consecutive pstack and prstat data should be collected every second. Troubleshooting a Passive HangA hang is passive if the top or vmstat 1 output show low CPU levels. Possible Causes of a Passive HangPossible causes of a passive hang include the following:
Collecting and Analyzing Data About a Passive HangOn a Solaris system, collect several traces of the Directory Server process stack that is hanging using the Solaris pstack utility. Run the command from the root-dir/bin/slapd/server directory. You must collect this information while the server is hanging. The consecutive pstack data should be collected every three seconds. Collect several core files that show the state of the server threads while the server is hanging. Do this by generating a core file using the gcore command, changing the name of the core file, waiting 30 seconds, and generating another core file. Repeat the process as least once to get a minimum of three sets of core files and related data. For more information about generating a core file, see Generating a Core File. Troubleshooting Database ProblemsThis section describes how to troubleshoot an inaccessible database Possible Causes of Database ProblemsThe Directory Server database may be inaccessible for one of the following reasons:
|
# install-path/instance-name/db/guardian # install-path/instance-name/db/_db.00* |
If the start succeeds and the database still cannot be loaded, continue with this procedure.
Backup up all database file stored in the db/ directory.
Collect error and access log files from the time during which the database was inaccessible.
# install-path/instance-name/logs/errors* # install-path/instance-name/access* |
This section describes how to troubleshoot a memory leak.
Memory leaks are caused by problems allocating memory, either in Directory Server itself or in custom plug-ins. Troubleshooting these problems can be very difficult, particularly in the case of custom plug-ins.
It is important to do the following before collecting data about your memory leak:
Disable any custom plug-ins
Reduce the cache setting to very low values
Enable the audit log
One you have done the above, run a test that proves your memory leak. During the life of the test run, gather output from the pmonitor utility as follows:
|
|
The pmonitor utility is a process monitor.
Collect the generic Directory Server data, as described in Collecting Generic Data. This data includes the version of Directory Server that you are running, logs from the test run, in particular the audit log, and the Directory Server configuration file.
With the data you collected, you can now contact the Sun Support Center for assistance with your problem.
On Solaris systems, the libumem library is a memory agent library that tracks all of the address allocated into the process memory footprint. Usually it is not used in a production environment because it is much slower. However, it is helpful for analyzing the cause of a memory leak. For more information about the libumem library, see the technical article at the following location: http://access1.sun.com/techarticles/libumem.html
Restart the Directory Server using the following command:
# SUN_SUPPORT_SLAPD_NOSH=true LD_PRELOAD=libumem.so \ UMEM_DEBUG=contents,audit=40,guards UMEM_LOGGING=transaction ./start-slapd |
The libumem library is now loaded before the Directory Server starts, instead of using SmartHeap.
Next, run the gcore command several times, once before the memory use started to grow and once after. The gcore command provides a list of addresses and pointers. Use these to read the libumem library.
# cd install-root/bin/slapd/server gcore -o /tmp/directory-core process-id |
Finally, use the mdb and splitrec tools to analyze the results. The splitrec tool compares the results to see the complete stack of the leak.
# cd install-root/bin/slapd/server echo "::umausers -e" | mdb ./ns-slapd path_gcore1 > res.1 eacho "::umausers -e" | mdb ./ns-slapd path_gcore2 > res.2 splitrec -1 res.1 res.2 |
The splitrec tool is available through Sun Support. This tool provides a summary of the stacks that have been identified as responsible for leaking allocation stacks. Sun Support can use the contents of these stacks to identify known memory leaks in the SunSolve database. Sometimes the splitrec tool does not provide any output because by default it is configured to report leaks only for stacks that have been identified as leaking more than 100 times. Configure this limit to a lower value using the splitrec -l option.