Solstice SyMON User's Guide
  Cerca solo questo libro
Scarica il manuale in formato PDF

Understanding and Using Event Rules

4

This chapter provides an overview of how to understand and write Solstice SyMON event rules. This chapter describes:
  • Event rules terminology
  • Tcl rules
  • Hierarchies
  • Simple rules
  • Complex rules
  • How to write event rules
  • How to verify new event rules
  • Reserved words
  • Special characters
  • Debugging tips

Event Rules Terminology

Events are hardware or operating system conditions that may require the attention of a system administrator. Examples of events include: the loss of a CPU or disk, processor overload, or excessive swapping.
The Event Handler subsystem of Solstice SyMON alerts you to events. However, you must first write a rule that defines the event. Rules include a condition and other attributes that define the state of the rule and actions to take.
A condition is an expression that defines when a rule is active. An example of a condition might be a failed board.
An action tells Solstice SyMON what to do when a condition is true, if the condition changes, or if the system shuts down. Actions notify users of a situation that may require attention.
In addition to conditions and actions, a rule may also include other attributes, such as the level, priority, and severity of a rule.
The Event Handler collects data from monitoring agents and evaluates the data against its rules. When a condition is true, the Event Handler logs an event and carries out the appropriate action in the rule. When the condition that generated the event no longer exists, the Event Handler may run a special action such as closing the event. The Event Handler is always running. If it stops and restarts, all previously open events are closed.

Tcl Rules

The event rules are written in the Tcl scripting language. For a definition of Tcl syntax and complete instructions on how to write Tcl scripts, refer to:
  • Tcl and the Tk Toolkit, by John K. Ousterhout, Addison-Wesley Publishing Co.: 1994.
  • Practical Programming in Tcl and Tk, by Brent B. Welch, Prentice Hall: 1995.
  • Tcl Reference card--available from Specialized Systems Consultants, Inc., P.O. Box 55549, Seattle WA., 98155-0549.

Caution - Do not modify event rules if you do not know Tcl.

For additional information, contact software support at Sun Microsystems, Inc. at 1-800-USA-4SUN or 1-800-872-4786.
Solstice SyMON includes a set of pre-defined rules in the Tcl variable RULES located in the /etc/opt/SUNWsymon directory.The following rule files are included:
  • rules.tcl -- organizer
  • cprules.tcl-- capacity planning rules
  • hwrules.tcl-- hardware monitoring rules
  • syrules.tcl-- Solstice SyMON monitoring rules
  • egrules.tcl-- Event Handler rules
  • pfrules.tcl-- predictive failure rules
  • swrules.tcl-- system monitoring rules
Writing rules is simplified by using a set of Tcl commands that tell Solstice SyMON what to do in a given situation. For example, the Tcl alarm command tells Solstice SyMON what to do to make an event active. The Event Handler reads the rules file.
You may expand Tcl to include your own procedures by editing the event_gen.tcl file.

Special Characters

Tcl includes these special characters:
  • "(quotes)
  • { }(curly bracket)
  • [ ](square bracket)
  • # (pound sign)
  • * (asterisk)
Writing rules is simplified by using a set of Tcl commands and procedures. For more information on these characters, refer to the Tcl reference material in the "Tcl Rules" section.

Reserved Words

Table 4-1describes the Tcl reserved variable names, which contain certain values that have meaning for the Event Handler. You cannot redefine these variable names.
Table 4-1
Variable NameMeaning
symond_statusTells whether the Solstice SyMON daemon is running
server_statusTells whether the monitored system is up
LogScanner_statusTells if the Log Scanner agent is running
ConfigReader_statusTells if the Config Reader agent is running
KernelReader_statusTells if the Kernel Reader agent is running
nodeContains the hierarchy that is being evaluated in the rule
myrootRoot node for the Event Handler hierarchy; it should not be changed by the user

Event Rule Attributes

Each attribute, with the exception of the condition attribute has a label followed by a value. The condition attribute does not have a label and is a mandatory attribute for each rule.
Table 4-2 contains a list of attribute descriptions.
Table 4-2
NameValueDescription
RULEIntegerRule number
ON_OPENTcl scriptTcl string to be interpreted when the condition of the rule becomes true
ON_CLOSETcl scriptTcl string to be interpreted when the condition of a rule is no longer true
ON_CONTINUETcl scriptTcl string to be interpreted when the condition of a rule continues to be true
Table 4-2 (Continued)
NameValueDescription
ON_SHUTDOWNTcl scriptTcl string to be interpreted when the Event Handler shuts down
PARAMETERSStringUser-defined parameters
MULTITcl scriptTcl string to provide a list of data node variables for multiple events of a single rule
SEVERITYIntegerSeverity of the event (user-defined and interpreted)
PRIORITYIntegerPriority for the event (user-defined and interpreted)
RATEIntegerFor the user-defined rule sampling rate, in seconds
COMMENTS or CStringComments field
LOG_RULESLog scriptTcl string that defines the log file scanner activity that supports rules

Rule Functions

A rule function is a Tcl command used in Tcl scripts associated with the RULES variable. Table 4-3 lists common commands.
Table 4-3
CommandDescription
alarmMakes an event active, highlights the hierarchy node (RED, YELLOW, or BLUE) and creates entries in both the event log and the Event Viewer (with predefined message) of the event. The syntax is alarm level node "message" "Tcl command".
end_alarmCloses the log entry in the Event Log; the syntax is end_alarm.
findlistBuilds the list of matching nodes. The syntax is findlist hierarchypath striplist.
Table 4-3 (Continued)
CommandDescription
findvalueTakes the name of any data hierarchy variable and returns the
value of the variable; the syntax is findvalue node (where node
is the hierarchy endpoint).
getfieldReturns the internal data value from a rule or a rule-node combination (list from a MULTI string). The syntax is getfield optional_rule_# field_type. For example, { [getfield COUNT] == 1}. The following field specifiers can be used with getfield:

- ACTIVE: True or false if the rule is currently active - COUNT: Returns the number of consecutive iterations of the rule being active

- PRIORITY: Returns the priority value of the rule - SEVERITY: Returns the severity value of the rule - RULE: Returns the rule number - START_TIME: Returns the time the rule became active

putfieldTakes a field name and data and assigns that data to the field of the current rule. The syntax is

putfield optionalrule# fieldtype "value".

get_parameterManages a parameter list; used for MULTI rules that need to maintain historical records. The syntax is get_parameter string.
put_parameterManages a parameter list; used for MULTI rules that need to maintain historical records. The syntax is put_parameter tag value.
gettimeGets the sample time on the monitored machine in a long integer. The syntax is gettime.
dynlinkTakes a shared object and procedure name and dynamically links the shared object and calls the name procedure. The syntax is dynlink file functionname.
Table 4-3 (Continued)
CommandDescription
mailtoSends a message to a specified name by email. The syntax is mailto address "msgstring."
syslogWrites the specified string to the syslog. The syntax is syslog "message."
snmpInitiates SNM traps . snmp takes a string as an argument and generates snmp traps on every machine in the snmp_hosts variable. snmp_hosts is defined in event_gen.tcl. The syntax is snmp "message."

Hierarchies

Hierarchies organize data in Solstice SyMON to handle the grouping of related pieces of information. A hierarchy has a:
  • Top level node representing all data from an agent
  • Subset of the top level node for classes of data
  • Subnode for closely related data and properties containing the actual data
Each node and subnode organize the data beneath it. For example, in:

  KernelReader.cpu.cpu1.busy  

  • The top level KernelReader indicates that this is KernelReader data.
  • KernelReader.cpu indicates that this is CPU data.
  • KernelReader.cpu.cpu1 indicates that this data is related to CPU1.
  • KernelReader.cpu.cpu1.busy is the percentage of time that the CPU1 was busy.
For more information on what data is available, refer to the Solstice SyMON man pages. For detailed information on the Kernel Reader, see Appendix A, "Kernel Reader."
The following sections present examples of simple and complex rules.

Simple Rules

A simple rule checks one or more variables in a simple condition and generates one event if the condition is true. Code Example 4-1 is an example of a simple rule.

  {  
          RULE 2  
          { expr { "$server_status" == "dead" } }  
          ON_OPEN { alarm RED "" "Server not responding" "" }  
          ON_CLOSE { end_alarm }  
          SEVERITY 1  
          PRIORITY 1  
  }  

Code Example 4-1 Simple Rule Example
The first attribute in this rule of Code Example 4-1is the rule number.
The second attribute is the condition. This condition checks to see if the server_status variable is equal to "dead." If server_status is equal to "dead," the rule is active. expr is a Tcl function that evaluates an expression and returns a value. Tcl variables such as $server_status in Code Example 4-1 are part of the Event Handler variables.
The third and fourth attributes are a set of actions that are carried out as the conditions of the rule change. When the rule becomes active, the rule triggers an alarm with the RED condition. The predefined Tcl alarm command activates an event and writes an entry into the Event Log. When the condition is no longer true (ON_CLOSE), the rule triggers end_alarm, which closes the log entry in the Event Log, removes any highlighting in the GUI, and deletes the event from the open event list.
The SEVERITY and PRIORITY of the rule are the last two attributes in the rule. These are numeric values defined by the user.
The following rule in Code Example 4-2 is slightly more complicated. It defines a swap space event:

  {  
          RULE 18  
          {  
           set ts [ expr 0.10 * [ findvalue \  
  KernelReader.mem.swap_total ] ]  
         expr { [findvalue KernelReader.mem.swap_free ] < $ts }  
          }  
        ON_OPEN { alarm RED KernelReader.mem.swap_free "Serious Swap  
  Problem" "" }  
          ON_CLOSE { end_alarm }  
          SEVERITY 2  
          PRIORITY 1  
  }  

Code Example 4-2 Rule 18: Monitoring Swap Space
The first attribute in this rule is the rule number.
The next attribute is the condition. The condition does the following:
  • Sets variable ts (total swap space) to 10 percent of the total swap space available on the machine. KernelReader.mem.swap_total is a performance property in the data hierarchy of Solstice SyMON; findvalue is a predefined Tcl command that returns the value of the performance variable
  • Finds out how much swap space is free and unused
  • Compares the values for unused swap space and total swap space; if the unused swap space is less than 10 percent of total swap space, there is a potential problem
The ON_OPEN attribute tells the Event Handler to generate a RED alarm by calling the alarm function with the RED argument. This highlights a node on the Kernel Data Catalog and adds the event to the Event Log. All the arguments for the ON_OPEN alarm function attribute are mandatory. If an argument is not used, it is replaced by a set of double quotes ("").
The ON_CLOSE attribute tells Solstice SyMON what to do when the condition becomes false; end_alarm closes the log entry in the Event Log and the event in the Event Viewer.

Note - Always explicitly close an alarm with the end_alarm function. An alarm does not automatically close when the condition goes away.

The SEVERITY and PRIORITY of the rules are the last two attributes. These are both user-defined and interpreted.

Complex Rules

A complex rule checks the condition against several hierarchy nodes. If any condition is true, it generates an event for that variable.
Complex rules eliminate the need to write many simple rules that check the same condition. For example, if you use simple rules to check the condition of each CPU on a server, you will write many simple redundant rules.
Code Example 4-3 is an example of a complex rule:

  {  
         RULE 1  
          MULTI { expr { [ findlist system.*.*.*.status "" ] } }  
          {  
                  set boardstatus [ findconfigvalue $node ]  
                  expr { "$boardstatus" == "failure detected" }  
          }  
          ON_OPEN { alarm RED $node "Board failure detected" "" }  
          ON_CLOSE { end_alarm }  
          SEVERITY 1  
          PRIORITY 1  
  }  

Code Example 4-3 Complex Rule Example
Rule 1 examines the status of all boards in a system in Code Example 4-3. The condition of the rule is that the board has failed. If the board fails, the Event Handler opens a RED event and sends a message to the Event Log (ON_OPEN). When the condition is no longer true (ON_CLOSE), the Event Handler executes the end_alarm procedure.
The MULTI attribute is a Tcl script that evaluates an expression and creates a list of nodes. The condition of the rule is run once for each node in the list, and the Tcl node variable is assigned the value of the node being processed or evaluated. Any other Tcl script associated with the rule can access this value with $node. The normal approach is to use the findlist function to generate a list of nodes and check their values in a rule. A findlist must be present in a MULTI attribute.
The following statement, taken from Code Example 4-3, generates a list of all data items that start with system and end with status.

  MULTI { expr { [ findlist system.*.*.*.status "" ] } }  

Writing Event Rules

You can create rules against data from any of the three Solstice SyMON agents: Config Reader, Kernel Reader, and Log File Scanner. The first two agents provide data continuously to the Event Handler. You only need the path name to the variable.
For the Log File Scanner, you must pre-define the messages that are sent to the Event Handler for examination.
Here are a few guidelines to keep in mind when writing event rules:
  • A rule is a Tcl variable.
  • The Event Handler rule attributes and labels are strings to Tcl.
  • A Tcl variable is a string or list of strings.
  • Rules are free form. They only require attributes (label and data) pairs and can appear in any order.
  • Attributes are separated by new lines or spaces.
  • Each attribute consists of one or two components. The first component is the name (label) and the second component is the associated value. The exception to this rule is the condition, which does not have a label.
  • An attribute's components are separated by spaces, tabs, or new lines.
  • Only the last instance of each label is used per rule. All others are ignored.
  • Spaces are allowed in an attribute item data if it is enclosed in quotes.
  • Each label may have one unique item.
  • Each Tcl command string is enclosed in curly brackets.
  • A condition is the only mandatory attribute in a rule.
  • A condition of a rule can look at time or other values in the environment. For example, "set current_time [gettime]" sets the user-defined variable, current_time, to the current time on the monitored machine.
  • A rule does not have to include an action.
  • A rule can set a value that is used in other rules.
  • All rules are combined into a single Tcl variable called RULES.
Solstice SyMON can execute phone_home scripts.

Verifying New Event Rules

Solstice SyMON includes the special verify_rules command, which checks for Tcl syntax. The command takes three optional arguments, which are described in Table 4-4.
Table 4-4 verify_rules
ArgumentDescription
-RChecks the file that contains the EVENTS variable; the default is rules.tcl
-IChecks the file that contains supporting functions; the default is event_gen.tcl
-oGives more verbose output
* To verify rules, enter:

  $ verify_rules filename  

The filename entry is optional. The default filename is rules.tcl.
When you run verify_rules and the rules are correct, the program responds, "GOOD RULES." If the program detects a syntax error, the program responds, "BAD RULES." This does not guarantee that the rule will work as written.

Activating New or Modified Events

To activate new or modified rules, send the signal SIGHUP to the Event Handler or restart the Event Handler. Send SIGHUP by invoking the following command:

  % kill -HUP pid  

where pid is the process ID number of the Event Generator, sm_egd. For more information, see the kill man page.

Debugging Tips

Use the following debugging tips for event rules:
  • Run verify_rules to make sure the syntax is accurate.
  • If there is a problem with the rules, change the rules.tcl file so you test only one rule at a time.
  • Use this event log file to search for error messages: /var/opt/SUNWsymon/machine_name/event_log.
  • Use the syslog command to log variable values and to confirm the values.