包含在尋找其他文件熱門支援資源 | 以 PDF 格式下載這本書 (5790 KB)
Chapter 27 Monitoring Messaging ServerIn most cases, a well-planned, well-configured server will perform without extensive intervention from an administrator. As an administrator, however, it is your job to monitor the server for signs of problems. This chapter describes the monitoring of the Messaging Server. It consists of the following sections: Troubleshooting procedures can be found in Chapter 26, Troubleshooting the MTA 27.1 Automatic Monitoring and RestartMessaging Server provides a way to transparently monitor services and automatically restart them if they crash or become unresponsive (the services hangs or freeze up). It can monitor all message store, MTA, and MMP services including the IMAP, POP, HTTP, job controller, dispatcher, and MMP servers. It does not monitor other services such as SMS or TCP/SNMP servers. (TCP/SNMP is monitored by the job controller.) Refer to 4.5 Automatic Restart of Failed or Unresponsive Services and 27.8.9 Monitoring Using msprobe and watcher Functions. 27.2 Daily Monitoring TasksThe most important tasks you should perform on a daily basis are checking postmaster mail, monitoring the log files, and setting up the stored utility. These tasks are described below. 27.2.1 Checking postmaster MailMessaging Server has a predefined administrative mailing list set up for postmaster email. Any users who are part of this mailing list will automatically receive mail addressed to postmaster. The rules for postmaster mail are defined in RFC822, which requires every email site to accept mail addressed to a user or mailing list named postmaster and that mail sent to this address be delivered to a real person. All messages sent to postmaster@host.domain are sent to a postmaster account or mailing list. Typically, the postmaster address is where users should send email about their mail service. As postmaster, you might receive mail from local users about server response time, from other server administrators who are encountering problems sending mail to your server, and so on. You should check postmaster mail daily. You can also configure the server to send certain error messages to the postmaster address. For example, when the MTA cannot route or deliver a message, you can be notified via email sent to the postmaster address. You can also send exception condition warnings (low disk space, poor server response) to postmaster. 27.2.2 Monitoring and Maintaining the Log FilesMessaging Server creates a separate set of log files for each of the major protocols or services it supports including SMTP, IMAP, POP, and HTTP. These are located in msg-svr-base/data/log. You should monitor the log files on a routine basis--especially if you are having problems with the server. Be aware that logging can impact server performance. The more verbose the logging you specify, the more disk space your log files will occupy for a given amount of time. You should define effective but realistic log rotation, expiration, and backup policies for your server. For information about defining logging policies for your server, see Chapter 25, Managing Logging. 27.2.3 Setting Up the msprobe UtilityThe msprobe utility automatically performs monitoring and restart functions. For further information see 27.8.9 Monitoring Using msprobe and watcher Functions 27.3 Monitoring System PerformanceThis chapter focuses on Messaging Server monitoring, however, you will also need to monitor the system on which the server resides. A well-configured server cannot perform well on a poorly-tuned system, and symptoms of server failure may be an indication that the hardware is not powerful enough to serve the email load. This chapter does not provide all the details for monitoring system performance as many of these procedures are platform specific and may require that you refer to the platform specific system documentation. The following procedures are described here for performance monitoring: 27.3.1 Monitoring End-to-end Message Delivery TimesEmail needs to be delivered on time. This may be a service agreement requirement, but also it is good policy to have mail delivered as quickly as possible. Slow end-to-end times could indicate a number of things. It may be that the server is not working properly, or that certain times of the day experience overwhelming message loads, or that the existing hardware resources are being pushed beyond their capacity. 27.3.1.1 Symptoms of Poor End-to-end Message Delivery TimesMail takes a longer period of time to be delivered than normal. 27.3.1.2 To Monitor End-to-end Message Delivery Times
27.3.2 Monitoring Disk SpaceInadequate disk space is one of the most common causes of the mail server problems and failure. Without space to write to the MTA queues or to the message store, the mail server will fail. In addition, unless log files are monitored and cleaned up, they can grow uncontrollably filling up all disk space. Message store partitions grow as new messages are delivered to the mailboxes; for example, if message store quotas are not enforced, the message store can outgrow the disk space available for a partition. Another cause of running out of disk space are the MTA message queues growing too large. A third area of concern is if a problem occurs with the log file monitoring facilities and the log files growing uncontrollably. (Note that there are a number of log files such as LDAP, MTA, and Message Access, and that each of these log files can be stored on different disks.) 27.3.2.1 Symptoms of Disk Space ProblemsDifferent symptoms can occur depending on which disk or partition is running out of space. MTA queues can overflow and reject SMTP connections, messages might remain in the ims_master queue and not be not delivered to the message store, and log files can overflow. If a message store partition fills up, message access daemons can fail, and message store data can be corrupted. Message store maintenance utilities such as imexpire and reconstruct can repair the damage and reduce disk usage. However, these utilities require additional disk space, and repairing a partition that has filled an entire disk can potentially cause down time. 27.3.2.2 To Monitor Disk SpaceDepending upon the system configuration you may need to monitor various disks and partitions. For example, MTA queues may reside on one disk/partition, message stores may reside on another, and log files may reside on yet another. Each of these spaces will require monitoring and the methods to monitor these spaces may differ. Messaging Server provides specific methods for monitoring message store disk usage and preventing partitions from filling up all available disk space. You can take the following steps to monitor the message store’s use of disk space:
For details, see the sections that follow: Monitoring the Message Store and Monitoring Message Store Partitions. Monitoring the Message StoreIt is recommended that message store disk usage not exceed 75% capacity. You can monitor message store disk usage by configuring the following alarm attributes using the configutil utility: By setting these parameters, you can specify how often the system should monitor disk space and under what circumstances the system should send a warning. For example, if you want the system to monitor disk space every 600 seconds, specify the following command: configutil -o alarm.diskavail.msgalarmstatinterval -v 600 If you want to receive a warning whenever available disk space falls below 20%, specify the following command: configutil -o alarm.diskavail.msgalarmthreshold -v 20 Refer to Table 27–6 for more information on these parameters. Monitoring Message Store PartitionsYou can halt messages from being delivered to a message store partition when the partition fills more than a specified percentage of available disk space. This is done by setting two configutil parameters to enable the feature and specify the disk-usage threshold. With this feature, the message store daemon monitors the partition’s disk usage. As disk usage increases, the store daemon dynamically checks the partition more frequently (ranging from once every 100 minutes to once a minute). If disk usage goes higher than the specified threshold, the store daemon:
When disk usage falls below the threshold, the partition is unlocked, and messages are again delivered to the store. The configutil parameters are as follows:
You should set the disk-usage threshold to a percentage low enough to give you time to repartition or assign more disk space to the local message store. For example, suppose a partition fills up disk space at a rate of 2 percent per hour, and it takes an hour to allocate additional disk space for the local message store. In this case, you should set the disk-usage threshold to a value lower than 98 percent. Monitoring the MTA Queues and Logging SpaceYou will need to monitor MTA queue disk and logging space disk usage. For information on managing logging space, see Chapter 25, Managing Logging For example, to learn how to monitor the mail.log file, see 25.3 Managing MTA Message and Connection Logs 27.3.3 Monitoring CPU UsageHigh CPU usage is either a sign that there is not enough CPU capacity for the level of usage or some process is using up more CPU cycles than is appropriate. 27.3.3.1 Symptoms of CPU Usage ProblemsPoor system response time. Slow logging in of users. Slow rate of delivery. 27.3.3.2 To Monitor CPU UsageMonitoring CPU usage is a platform specific task. Refer to the relevant platform documentation. 27.4 Monitoring the MTAThis section consists of the following subsections: 27.4.1 Monitoring the Size of the Message QueuesExcessive message queue growth may indicate that messages are not being delivered, are being delayed in their delivery, or are coming in faster than the system can deliver them. This may be caused by a number of reasons such as a denial of service attack caused by huge numbers of messages flooding your system, or the Job Controller not running. See 8.5.2 Channel Message Queues, 26.3.6 Messages are Not Dequeued and 26.3.7 MTA Messages are Not Delivered for more information on message queues. 27.4.1.1 Symptoms of Message Queue Problems
27.4.1.2 To Monitor the Size of the Message QueuesProbably the best way to monitor the message queues is to use imsimta qm and imsimta summarize. Refer to 27.8.6 imsimta qm counters. You can also monitor the number of files in the queue directories (msg-svr-base/data/queue/). The number of files will be site-specific, and you’ll need to build a baseline history to find out what is “too many.” This can be done by recording the size of the queue files over a two week period to get an approximate average. 27.4.2 Monitoring Rate of Delivery FailureA delivery failure is a failed attempt to deliver a message to an external site. A large increase in rate of delivery failure can be a sign of a network problem such as a dead DNS server or a remote server timing out on responding to connections. 27.4.2.1 Symptoms of Rate of Delivery FailureThere are no outward symptoms. Lots of Q records will appear in to mail.log_current. 27.4.2.2 To Monitor the Rate of Delivery FailureDelivery failures are recorded in the MTA logs with the logging entry code Q. Look at the record in the file msg-svr-base/data/log/mail.log_current. Example: mail.log:06-Oct-2003 00:24:03.66 501d.0b.9 ims-ms Q 5 durai.balusamy@Sun.COM rfc822;durai.balusamy@Sun.COM durai@ims-ms-daemon <00ce01c38bda$c7e2b240$6501a8c0@guindy> Mailbox is busy 27.4.3 Monitoring Inbound SMTP ConnectionsAn unusual increase in the number of inbound SMTP connections from a given IP address may indicate:
27.4.3.1 Symptoms of Unauthorized SMTP Connections27.4.3.2 To Monitor Inbound SMTP Connections
Note that you will first need to determine the appropriate number of SMTP connections and their states (ESTABLISHED, CLOSE_WAIT, etc.) for your system to determine if a particular reading is out of the ordinary. If you find many connections staying in the SYN_RECEIVED state this might be caused by a broken network or a denial of service attack. In addition, the lifetime of an SMTP server process is limited. This is controlled by the MTA configuration variable MAX_LIFE_TIME in the dispatcher.cnf file. The default is 86,400 seconds (one day). Similarly, MAX_LIFE_CONNS specifies the maximum number of connections a server process can handle in its lifetime. If you find a particular SMTP server that has around for a long time you may wish to investigate. 27.4.4 Monitoring the Dispatcher and Job Controller ProcessesThe Dispatcher and Job Controller Processes must be operating for MTA to work. You should have one process of each kind. 27.4.4.1 Symptoms of Dispatcher and Job Controller Processes DownIf the Dispatcher is down or does not have enough resources, SMTP connections are refused. If the Job Controller is down, queue size will grow. 27.4.4.2 To Monitor Dispatcher and Job Controller ProcessesCheck to see that the processes called dispatcher and job_controller exist. See 26.2.4 Check that the Job Controller and Dispatcher are Running. 27.5 Monitoring LDAP Directory ServerThis section consists of the following subsection: 27.5.1 Monitoring slapdThe LDAP directory server (slapd) provides directory information for the messaging system. If slapd is down, the system will not work properly. If slapd response time is too slow, this will affect login speed and any other transaction that requires LDAP lookups. 27.5.1.1 Symptoms of slapd Problems
27.5.1.2 To Monitor slapd
27.6 Monitoring Message AccessThis section consists of the following subsections: 27.6.1 Monitoring imapd, popd and httpdThese processes provide access to IMAP, POP and Webmail services. If any of these is not running or not responding, the service will not function appropriately. If the service is running, but is over loaded, monitoring will allow you to detect this and configure it more appropriately. 27.6.1.1 Symptoms of imapd, popd and httpd ProblemsConnections are refused or system is too slow to connect. For example, if IMAP is not running and you try to connect to IMAP directly you will see something like this: telnet 0 143 Trying 0.0.0.0... telnet: Unable to connect to remote host: Connection refused If you try to connect with a client, you will get a message such as: “Client is unable to connect to the server at the location you have specified. The server may be down or busy.” 27.6.1.2 To Monitor imapd, popd and httpd
27.7 Monitoring the Message StoreMessages are stored in a database. The distribution of users on disks, the size of their mailbox, and disk requirements affect the store performance. These are described in the following subsections: 27.7.1 Monitoring storedstored performs a variety of important tasks such as deadlock and transaction operations of the message database, enforcing aging policies, and expunging and erasing messages stored on disk. If stored stops running, the messaging server will eventually run into problems. If stored doesn’t start when start-msg is run, no other processes will start. For more information about stored see stored in Sun Java System Messaging Server 6.3 Administration Reference. 27.7.1.1 Symptoms of stored ProblemsThere are no outward symptoms. 27.7.1.2 To Monitor stored
27.7.2 Monitoring the State of Message Store Database LocksThe state of database-locks is held by different server processes. These database locks can affect the performance of the message store. In case of deadlocks, messages will not be getting inserted into the store at reasonable speeds and the ims-ms channel queue will grow larger as a result. There are legitimate reasons for a queue to back up, so it is useful to have a history of the queue length in order to diagnose problems. 27.7.2.1 Symptoms of Message Store Database Lock ProblemsNumber of transactions are accumulating and not resolving. 27.7.2.2 To Monitor Message Store Database LocksUse the command imcheck -s (used to be counterutil -o db_lock) 27.8 Utilities and Tools for MonitoringThe following tools are available in for monitoring: 27.8.1 immonitor-accessimmonitor-access monitors the status of the following Messaging Server components/processes: Mail Delivery (SMTP server), Message Access and Store (POP and IMAP servers), Directory Service (LDAP server) and HTTP server. This utility measures the response times of the various services and the total round trip time taken to send and retrieve a message. The Directory Service is monitored by looking up a specified user in the directory and measuring the response time. Mail Delivery is monitored by sending a message (SMTP) and the Message Access and Store is monitored by retrieving it. Monitoring the HTTP server is limited to finding out whether or not it is up and running. For complete instructions, refer to immonitor-access in Sun Java System Messaging Server 6.3 Administration Reference. 27.8.2 imcheckUse imcheck —s to monitor database statistics including logs and transactions. 27.8.3 counterutilThis utility provides statistics acquired from different system counters. Here is a current list of available counter objects: Each entry represents a counter object and supplies a variety of useful counts for this object. In this section we will only be discussing the alarm, diskusage, serverresponse, popstat, imapstat, and httpstat counter objects. For details on counterutil command usage, refer to counterutil in Sun Java System Messaging Server 6.3 Administration Reference. 27.8.3.1 counterutil Outputcounterutil has a variety of flags. A command format for this utility may be as follows: counterutil -o CounterObject -i 5 -n 10 where, -o CounterObject represents the counter object alarm, diskusage, serverresponse, popstat, imapstat, and httpstat. -i 5 specifies a 5 second interval. -n 10 represents the number of iterations (default: infinity). An example of counterutil usage is as follows:
27.8.3.2 Alarm Statistics Using counterutilThese alarm statistics refer to the alarms sent by stored.The alarm counter provides the following statistics: Table 27–1 counterutil alarm Statistics
27.8.3.3 IMAP, POP, and HTTP Connection Statistics Using counterutilTo get information on the number of current IMAP, POP, and HTTP connections, number of failed logins, total connections from the start time, and so forth, you can use the command counterutil -o CounterObject -i 5 -n 10.where CounterObject represents the counter object popstat, imapstat, or httpstat. The meaning of the imapstat suffixes is shown in Table 27–2. The popstat and httpstat objects provide the same information in the same format and structure. Table 27–2 counterutil imapstat Statistics
27.8.3.4 Disk Usage Statistics Using counterutilThe command: counterutil -o diskusage generates following information: Table 27–3 counterutil diskstat Statistics
27.8.3.5 Server Response StatisticsThe command: counterutil -o serverresponse generates following information. This information is useful for checking if the servers are running, and how quickly they’re responding. Table 27–4 counterutil serverresponse Statistics
27.8.4 Log FilesMessaging server logs event records for SMTP, IMAP, POP, and HTTP. The policies for creating and managing the Messaging Server log files are customizable. Since logging can affect the server performance, logging should be considered very carefully before the burden is put on the server. Refer to Chapter 25, Managing Logging for more information. 27.8.5 imsimta countersThe MTA accumulates message traffic counters based upon the Mail Monitoring MIB, RFC 1566 for each of its active channels. The channel counters are intended to help indicate the trend and health of your e-mail system. Channel counters are not designed to provide an accurate accounting of message traffic. For precise accounting, instead see MTA logging as discussed in Chapter 25, Managing Logging. The MTA channel counters are implemented using the lightest weight mechanisms available so that they cause as little impact as possible on actual operation. Channel counters do not try harder: if an attempt to map the section fails, no information is recorded; if one of the locks in the section cannot be obtained almost immediately, no information is recorded; when a system is shut down, the information contained in the in-memory section is lost forever. The imsimta counters -show command provides MTA channel message statistics (see below). These counters need to be examined over time noting the minimum values seen. The minimums may actually be negative for some channels. A negative value means that there were messages queued for a channel at the time that its counters were zeroed (for example, the cluster-wide database of counters created). When those messages were dequeued, the associated counters for the channel were decremented and therefore leading to a negative minimum. For such a counter, the correct “absolute” value is the current value less the minimum value that counter has ever held since being initialized.
1) Received is the number of messages enqueued to the channel named tcp_local. That is, the messages enqueued (E records in the mail.log* file) to the tcp_local channel by any other channel. 2) Stored is the number of messages stored in the channel queue to be delivered. 3) Delivered is the number of messages which have been processed (dequeued) by the channel tcp_local. (That is, D records in the mail.log* file.) A dequeue operation may either correspond to a successful delivery (that is, an enqueue to another channel), or to a dequeue due to the message being returned to the sender. This will generally correspond to the number Received minus the number Stored. The MTA also keeps track of how many of the messages were dequeued upon first attempt; this number is shown in parentheses. 4) Submitted is the number of messages enqueued (E records in the mail.log file) by the channel tcp_local to any other channel. 5) Attempted is the number of messages which have experienced temporary problems in dequeuing, that is, Q or Z records in the mail.log* file. 6) Rejected is the number of attempted enqueues which have been rejected, that is, J records in the mail.log* file. 7) Failed is the number of attempted dequeues which have failed, that is, R records in the mail.log* file. 8) Queue time/count is the average time-spent-in-queue for the delivered messages. This includes both the messages delivered upon the first attempt, see (9), and the messages that required additional delivery attempts (hence typically spent noticeable time waiting fallow in the queue). 9) Queue first time/count is the average time-spent-in-queue for the messages delivered upon the first attempt. Note that the number of messages submitted can be greater than the number delivered. This is often the case, since each message the channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be two submissions (unless both are reached through the same channel). More generally, the connection between Submitted and Delivered varies according to type of channel. For example, in the conversion channel, a message would be enqueued by some other arbitrary channel, and then the conversion channel would process that message and enqueue it to a third channel and mark the message as dequeued from its own queue. Each individual message takes a path: elsewhere -> conversion E record Received conversion -> elsewhere E record Submitted conversion D record Delivered However, for a channel such as tcp_local which is not a “pass through,” but rather has two separate pieces (slave and master), there is no connection between Submitted and Delivered. The Submitted counter has to do with the SMTP server portion of the tcp_local channel, whereas the Delivered counter has to do with the SMTP client portion of the tcp_local channel. Those are two completely separate programs, and the messages travelling through them may be completely separate. Messages submitted to the SMTP server: tcp_local -> elsewhere E record Submitted Messages sent out to other SMTP hosts via the SMTP client: elsewhere -> tcp_local E record Received tcp_local D record Delivered Channel dequeues (delivers) will result in at least one new message enqueued (submitted) but possibly more than one. For example, if a message has two recipients reached via different channels, then two enqueues will be required. Or if a message bounces, a copy will go back to the sender and another copy may be sent to the postmaster. Usually that will be reached through the same channel. 27.8.5.1 Implementation on UNIX and NTFor performance reasons, a node running the MTA keeps a cache of channel counters in memory using a shared memory section (UNIX) or shared file-mapping object (NT). As processes on the node enqueue and dequeue messages, they update the counters in this in-memory cache. If the in-memory section does not exist when a channel runs, the section will be created automatically. (The imta start command also creates the in-memory section, if it does not exist.) The command imta counters -clear or the imta qm command counters clear may be used to reset the counters to zero. 27.8.6 imsimta qm countersThe imsimta qm counters utility displays MTA channel queue message counters. You must be root or mailsrv to run this utility. The output fields are the same as those described in 27.8.5 imsimta counters. See also imsimta counters in Sun Java System Messaging Server 6.3 Administration Reference. Example:
Every time you restart the MTA, you must run: # imsimta counters -create 27.8.7 MTA Monitoring Using SNMPMessaging Server supports system monitoring through the Simple Network Management Protocol (SNMP). Using an SNMP client (sometimes called a network manager) such as Sun Net Manager or HP OpenView (not provided with this product), you can monitor certain parts of the Messaging Server. Refer to Appendix A, SNMP Support for details. 27.8.8 imquotacheck for Mailbox Quota CheckingYou can monitor mailbox quota usage and limits by using the imquotacheck utility. The imquotacheck utility generates a report that lists defined quotas and limits, and provides information on quota usage. For example, the following command lists all user quota information:
The following example shows the quota usage for user sorook:
27.8.9 Monitoring Using msprobe and watcher FunctionsMessaging Server provides two processes, watcher and msprobe to monitor various system services. watcher watches for server crashes and restarts them as necessary. msprobe monitors server hangs (unresponsiveness). Specifically msprobe monitors the following:
watcher and msprobe are controlled by the configutil options shown in Table 27–5. Further information can be found in 4.5 Automatic Restart of Failed or Unresponsive Services Table 27–5 msprobe and watcher configutil Options
27.8.9.1 Alarm Messagesmsprobe can issue alarms in the form of email messages to the postmaster (see 27.6.1.2 To Monitor imapd, popd and httpd) warning of a specified condition. A sample email alarm sent when a certain threshold is exceeded is shown below:
You can specify how often msprobe monitors disk and server performance, and under what circumstances it sends alarms. This is done by using the configutil command to set the alarm parameters. Table 27–6 shows useful alarm parameters along with their default setting. See configutil Parameters in Sun Java System Messaging Server 6.3 Administration Reference. Table 27–6 Useful Alarm Message configutil Parameters
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||