Chat now with support
Chat with Support

Foglight Experience Monitor 5.8.1 - User Guide

Installing and configuring Multi-appliance clusters Configuring the appliance Specifying monitored web traffic Transforming monitored URLs Managing applications Foglight components and the appliance Using the console program Troubleshooting the appliance Appendix: Third party software Monitoring the user experience Customizing reports The alarm system Integrating the appliance SOAP-based web services

Creating an alarm

The Alarm Definitions > Edit page contains all configuration options used to define an alarm. The page is divided to sections that describe the alarm properties, each of which are covered in the following sections.

There are two ways to display the Alarms Definitions > Edit page and begin defining a new alarm:

1
Click Add a new Alarm, found at the bottom of the main Alarm Definitions page, accessed by clicking Alarms > Alarm Definitions.
2
Since alarms are based on one of the many metrics recorded by the appliance, you can also click Add a new Alarm on a Metric View or Metric Analysis page. For more information about these pages, see Metric View and Metric Analysis. This creates a new alarm definition based on the metric in the display, and allows you define the other characteristics of the alarm before saving it. Creating alarm definitions in this manner bypasses the need to choose a metric, which is outlined in the next section.

Alarm definitions

The first section on the Alarm Definition > Edit page defines the alarm. The type of configuration options that are available depends on the metric that are monitored for exceptions.

Figure 68. Alarm definition

2
Depending on how you initiated the alarm creation process, the Metric box may be empty. If so, click Choose Metric to open a new window that displays a list of metrics.

When a standard metric is chosen, you also see a statistic box to the right of the metric name.

Select one of five statistics associated with this metric: Mean, Minimum, Maximum, Standard Deviation, or Data Point.

See “Types of metrics” in the Foglight Experience Monitor Metric Reference Guide for information on standard metrics and statistics.

IMPORTANT: The Data Point statistic type can be used with Application Component, Subnet, Hit, and Page categories (in addition to Mean, Minimum, Maximum, and Standard Deviation). For more information, see Selecting data point statistics with metrics. This data point statistic allows you to focus the alarm on one of the metric categories listed above and, if triggered, write the individual user session to the User Sessions Log. For more information, see Using the User Sessions Log.

When a reference counter metric has been chosen, you can see Value configuration options below the metric name. Configure the values in the reference counter that trigger the alarm.

For more information about reference counters, see Foglight Experience Monitor Metric Guide.

Select a Value for the alarms.

All

Selecting this option causes an exception check to be made for every object in the list. Whenever a counter is updated for any object in the list, the counter's new value is compared against the selected threshold (defined in the next section).

Specific Value

Triggers an alarm only for a specific value in the reference counter. This option only appears if the metric has a fixed set of known values.

The Threshold option allows you to define the threshold that trigger the alarms if it is crossed. Whenever a metric for a resource is updated, the new value is compared against the threshold defined in this step.

The Severity list options allow you to assign a level of severity to the alarm. The severity level should indicate the level of importance assigned to events that trigger this alarm. You must decide which severity level to associate with the issue represented by the alarm.

There are four severity levels, each possessing its own color-coded icon.

Critical

red

to indicate an SLA violation or critical event

High

orange

to indicate a serious performance problem

Medium

yellow

to indicate potential problem

Low

green

use for alarms that are informational in nature

Absolute thresholds are constant values that you have determined from experience with your site, or that you would like to establish as performance targets. Whenever a metric is updated, the new value is compared against that alarm's threshold to check whether an alarm should be triggered.

Absolute thresholds provide a mechanism whereby service level agreements (SLAs) can be enforced. An SLA is a contract between a customer and a service provider that defines the precise level of service to be provided. For example, a provider of online customer relationship management (CRM) services to a corporation may guarantee a response time of eight seconds or better for any web page that is retrieved by users of the CRM service. By defining alarms on the Metric Analysis > Page > Page End-To-End Time metric with an eight second absolute threshold, a service provider can provide proof that the service is meeting the requirements detailed in the SLA.

When defining an absolute threshold for an alarm, you may choose from the following four options to dictate how the metric is compared to the threshold when the appliance checks to see whether an alarm has occurred.

greater than

When this option is used, an alarm is generated if the metric is greater than the threshold. All metric values are checked throughout the real-time interval each time the metric value changes.

less than

When this option is used, an alarm is generated if the metric is less than the threshold. Count and reference counter metric values are checked only when the real-time interval has completed. All other metric values are checked throughout the real-time interval each time the metric value changes.

within the range of

When this option is used, an alarm is generated if the metric falls within this range (for example, 10-20). Specifically, the metric needs to be greater than the low-end range, and less than the high-end range to generate an alarm. Count and reference counter metric values are checked only when the real-time interval has completed. All other metric values are checked throughout the real-time interval each time the metric value changes.

outside the range of

When this option is used, an alarm is generated if the metric lies outside this range. Specifically, the metric needs to be less than the low-end of the range, or greater than the high-end of the range to generate an alarm. Count and reference counter metric values are checked only when the real-time interval has completed. All other metric values are checked throughout the real-time interval each time the metric value changes.

Historical thresholds are only supported for standard metrics (those with a mean, minimum, maximum, and standard deviation, such as Page End-to-End Time) and counter metrics (for example, Page Download Attempts). These thresholds are calculated internally by the appliance, based on past observed behavior for the metric and resource(s) in question.

This feature is useful when the normal or expected behavior of a resource is unknown. The appliance should be in operation for at least one week before you begin using this type of threshold so that enough data can be gathered to generate realistic historical data.

To set an alarm using a historical threshold, you must provide a percentage to indicate how much the metric is allowed to deviate from the historical threshold before triggering the alarm. A smaller percentage means the metric has less room to deviate from the threshold and increases the chances of an alarm being generated. A larger percentage allows the metric to deviate more and decreases the chances of an alarm being generated.

The historical threshold setting models the typical behavior of a resource based on the daily baseline settings.

This is the value that will serve as the basis for the determination of whether the historical threshold has been violated. Daily baseline calculations includes metrics only from the same day. For example, the calculation for a Friday baseline includes only the metrics for each Friday found in the preceding 32 days, again, assuming that you have retained the default setting of 32 days.

Similar to absolute thresholds, there are four options that control how the metric is compared to the threshold.

greater than

The threshold that is used to determine if an alarm should be generated is calculated from the historical baseline by the following criteria:

Calculated threshold = historical baseline x (1 + X/100)


If the percentage is 0, then the calculated threshold equals the historical baseline. If the metric is greater than this calculated threshold, then an alarm is generated.

less than

The threshold that is used to determine if an alarm should be generated is calculated from the historical baseline by the following criteria:

Calculated threshold = historical baseline x (1 - X/100)


If the percentage is 0, then the calculated threshold equals the historical baseline. If the metric is less than the threshold, then an alarm is generated.

within the range of

Enter the percentage the metric can deviate from the historical threshold. The threshold range that is used to determine if an alarm should be generated is calculated from the historical baseline by the following formulas:

Calculated low-end of range = historical baseline x (1 - X/100)

Calculated high-end of range = historical baseline x (1 + X/100)


If the metric falls within this range, then an alarm is generated. In other words, if the metric is greater than the calculated low-end of the range and less than calculated the high-end of the range, then an alarm is generated.

outside the range of

Enter the percentage the metric can deviate from the historical threshold. The threshold range that is used to determine if an alarm should be generated is calculated from the historical baseline by the following criteria:

Calculated low-end of range = historical baseline x (1 - X/100)

Calculated high-end of range = historical baseline x (1 + X/100)


If the metric lies outside this range, then an alarm is generated. In other words, if the metric is less than the low-end of the range or greater than the high-end of the range, then generate an alarm.

The Resource option in the Definition section lets you specify which resources associated with the metric category can trigger the alarm.

There can be up to three available Resource options, depending on the metric that is chosen.

Any

If this option is chosen, the alarm is triggered for any resource in the category that matches the criteria specified in the alarm.

Choose Resource

Clicking Choose Resource opens a new window containing a list of eligible resources. Selecting a specific resource from this list results in the alarm being triggered only when that specific resource matches the criteria specified for the alarm.

Regular Expression

Any regular expression entered in this field is matched with every resource associated with the metric category. Only resources that match the regular expression are used to determine if an alarm is generated.

Use the
Test link to the right of the entry box to generate a list of resources that match the regular expression you have entered.

Whenever a resource that contains this regular expression is updated, the metric selected (earlier in this section) is checked against the alarm threshold. The alarm's Resource field is filled in with the name of the resource that caused the alarm.

Alarm conditions

The Conditions section contains options that help limit the frequency at which alarms are triggered, as well as prevent false positives. The options for setting these conditions vary based on the type of metric was selected in the Definition section of this Alarm Definitions > Edit page. By default, the following options are available.

Figure 73. Alarm conditions

The first two options (the Enable this alarm, and Do not generate this alarm more than once every X minute(s) check boxes) are always present for any type of metric. By default, newly defined alarms are enabled, and are triggered, at most, every five minutes.

The two option buttons (Trigger alarm only at the end of the 5-minute interval, and Trigger alarm in real time) allow you to specify when the alarm is triggered.

By default, the Trigger alarm only at the end of the 5-minute interval option is selected. This option ensures the alarm threshold is evaluated and possibly triggered at the end of a complete 5-minute interval. This setting is useful for metrics that reflect an average of values (for example, Mean Page Download Time), and are the most relevant at the end of the interval.

The Trigger alarm in real time options allows the alarm to be triggered at any time during the 5-minute interval. This setting is useful for count metrics (for example, Page Download Attempts) that steadily increase during the 5-minute interval and are, therefore, relevant at any time during the interval.

The Trigger alarm only at the end of the session option appears if the metric chosen for the alarm definition is from the User Session category. User Session metrics are recorded continuously but are not stored to the database in
five-minute intervals like all of the other metric categories. Choosing this option results in the alarm threshold being evaluated against the metrics for the entire User Session.

When a standard metric is chosen in the Definition section of the alarm definition page, and the default Mean statistic is used, additional options appear for the Trigger alarm in real time option.

Basing the alarm on a standard metric (Page End-To-End Time) enables additional options in the Conditions section. These additional options allow you to fine-tune how alarms are triggered in real time, and are explained in the next two sections.

Throttling is a mechanism for preventing a flood of new alarms when the alarm threshold is frequently violated. Ordinarily, once a problem is reported with an alarm, the administrator does not want a continuous stream of new alarms to show up for the same problem. Repeat alarms that are sent by email can be even more troublesome by filling up the administrator’s mailbox.

One cause of alarm floods is the definition of an improper threshold value. If the selected threshold is too close to the average value for the metric, then the threshold is repeatedly violated during normal operations. Excessive alarms may indicate that an alarm threshold should be adjusted to more accurately reflect abnormal behavior.

Another simple way to prevent alarm floods is to limit how often the same alarm can be generated. The system provides a mechanism for specifying a certain amount of time that must expire before the same alarm can be sent again (Condition 2: Time-Based Throttle).

Another throttling mechanism is built into the system, but is not under the user’s control. For each alarm defined, at most 50 alarms can be generated for a given five minute interval.

False positives typically occur when a small number of exceptions cause an alarm to be generated even though the monitored web site for application is functioning properly. In most cases, you want an alarm to be generated when either a large number of exceptions occur relative to the number of normal events or a smaller number of exceptions continue to occur over long periods of time. Cases where the number of exceptions is excessive or persistent may indicate the beginnings of a failure somewhere in the web site.

The types of events that are checked by the alarm system is determined by the type of metric that was selected for the alarm. A good example is Page End-To-End Time. Every page that is downloaded is considered an event. If the time to download the page exceeds the threshold specified in the alarm, then the event is considered an exception. This exception is then processed to see if an alarm needs to be triggered based on the throttling and false-positive conditions set in the Conditions section of the alarm definition page.

The following conditions control how many exceptions must occur before an alarm is triggered.

Enable the alarm.

This condition simply controls whether the alarm is active. By default, the alarm is enabled. As needed, the alarm can be disabled then re-enabled later.

Do not generate this alarm more than once every X minute(s).

This condition limits how often an alarm on a specific resource can be triggered. The minimum value is 1 minute.

Only generate this alarm after analyzing X events.

This restriction ensures that there are enough samples to warrant the generation of an alarm. At the start of each five-minute interval, the event count is set to 0. If the first event analyzed exceeded the threshold, then an alarm would be generated immediately. This would be true even if the next 1000 events did not exceed the threshold.

This type of false-positive can be avoided by waiting until at least X events have been processed before checking for exceptions. This condition only comes into play at the start of each five-minute interval. After the minimum number of events have been analyzed, this condition is ignored. This condition is only tested after Condition 2 is met (minimum time has passed).

Only generate this alarm if the threshold is exceeded for X% of the events analyzed.

This restriction prevents a small number of anomalies from generating an alarm.

This condition dictates that before an alarm is triggered, at least X out of 100 events analyzed must exceed the threshold. For example, consider a web page that is downloaded 100 times during a five-minute interval with an alarm defined to trigger if the page end-to-end time exceeds a threshold of 10 seconds and the Condition 4 Percentage is set to 20%. If one of the page downloads took more than 10 seconds, then only 1% of the events have exceeded the threshold and the alarm would not pass this condition (and not be triggered). However, if 25 of the page end-to-end times exceed the 10 second threshold, then that would mean 25% of the events failed and the alarm would meet this condition (and trigger).

This condition is only tested after condition 2 and 3 are met (minimum time and minimum number of events).

This restriction only generates a second alarm if the threshold is still exceeded after at least X% more events have been analyzed.

This restriction ensures that a duplicate alarm is only triggered if the failure originally identified in the first alarm is sustained through a specified number of events. For example, consider a web page that is downloaded 1000 times during a five-minute interval with an alarm defined to trigger if the page end-to-end time exceeds a threshold of 10 seconds and the Condition 4 Percentage is set to 20% and Condition 5 Percentage of 50%. After 1000 downloads, 25% have exceeded the threshold of 10 seconds, satisfying condition 3 and causing an alarm to be triggered. How can the system tell if the problem that triggered the first alarm persists? In this example, Condition 5 says check again for the problem after 50% more events or after 500 additional page downloads have been processed beyond the original 1000 downloads. If 20% (from condition 4) of those additional 500 downloads exceed the threshold, then another alarm can be triggered.

This condition only comes into play after an alarm has been triggered once in a five-minute interval.

In organizations that use multiple the appliance together, some types of collected data are consolidated on a central Portal, while other types are stored on individual collection appliances.

You can learn more about multi-appliance configurations in the Foglight Experience Monitor Installation and Administration Guide. Refer specifically to Aggregating Metrics for information on which metric categories are stored.

When defining an alarm in a multi-appliance environment, its timing is affected by the location of the stored metric data that triggers the alarm.

When defining an alarm, if you select a metric that is consolidated at the Portal, the Conditions section of the alarm definition page updates to indicate, by default, that the alarm triggers on the consolidated data stored on the Portal.

As shown above, the alarm can only be triggered at most, every five minutes. This minimum wait time exists since metric data is consolidated at the Portal in five-minute intervals. Once data has been aggregated, alarm conditions are then checked in order to determine if any actions must be taken.

If you wish to trigger an alarm on this consolidated data in real time, select Distributed Monitors from the Trigger alarm on box.

When you select Distributed Monitors from the Trigger alarm on box, the Conditions section of the alarm definition page updates to include a real-time alarm option.

Alarms that are defined to be triggered on aggregative data on Distributed Monitors can respond in real time because they are based on the metric data collected at each Distributed Monitor before they are sent to the Portal Appliance for aggregation every five minutes.

Using this option may be desirable if you would like to diagnose or troubleshoot an issue in real time, but are working with an aggregative metric that is normally consolidated at the Portal Appliance every five minutes.

For example, Server is an aggregative metric category, but you may want to define an alarm for the Hit Errors metric in real time. If you choose to trigger the alarm on Distributed Monitors, then the instant any of your servers pass the defined threshold for hit errors, an alarm is triggered and post-alarm actions take effect immediately.

You do not want to set alarms on Distributed Monitors for enterprise-wide metrics, such as the Command Processing Time Service Level metric, also found in the Service category. In this case, an alarm based on this metric is most effective if it is triggered on all the aggregated data at the Portal Appliance.

When defining an alarm, if you chose an unconsolidated metric (that is, a metric that is not aggregated and stored at the Portal Appliance) the alarm always triggers on the data stored on the Distributed Monitors. This is reflected in the Conditions section of the alarm definition page.

You have the option of triggering the alarm in real time, or in set intervals. Both are possible, since the data on which the alarm is triggered never needs to move through a five-minute aggregation interval.

Alarm actions

When an alarm is triggered, the system can automatically perform a variety of actions. The Actions section of the Alarm Definitions > Edit page allows you to configure what the appliance does after an alarm has been triggered.

Figure 78. Alarm actions

Defining which actions the appliance performs when an alarm is triggered. The following sections describe each of these actions.

When an alarm is triggered, and this action is enabled, an email is sent to every user in the selected email group. The mail server must be configured properly for the emails to be sent. The appliance attempts to coalesce multiple alarms into a single email so the mail recipient does not get several emails when multiple alarms are triggered during a short time span.

For more information, see “User group” and “Mail server” in the Foglight Experience Monitor Installation and Administration Guide.

The email notification contains the following information for every alarm that is triggered:

When an alarm is triggered, and this action is enabled, the appliance performs a traceroute to the specified destination. Many administrators use tracerouting as a trouble-shooting tool when problems are detected in their network. The results of the traceroute are saved with the alarm, and can be viewed at any time by clicking the alarm in the alarm log.

For more information on verifying traceroutes in the Console Setup program, see “Troubleshooting” in the Foglight Experience Monitor Installation and Administration Guide, for an explanation of the Troubleshooting console menu.

There are six different traceroute options.

IP

Runs a traceroute to the specified IP in the entry field (for example, 192.168.1.10).

Domain

Runs a traceroute to the specified domain name in the entry field (for example, “quest.com”).

Gateway

Runs a traceroute to the network gateway configured during the system installation.

Primary DNS Server

Runs a traceroute to the primary DNS server configured during the system installation.

Secondary DNS Server

Runs a traceroute to the secondary DNS server configured during the system installation.

Resource that caused this alarm

If the alarm metric is in the Server, Site, or User Detail categories, then this additional traceroute option is available. When an alarm is triggered on one of these resources (for example, specific server, specific web site, or specific user IP), the system performs a traceroute to the problem resource.

 

If the alarm category is Server, then the system trace routes to the server IP that caused the alarm. If the alarm category is web site, then the system trace routes to the web site that caused the alarm using the web site's default IP. If the alarm category is User Detail, then the system trace routes to the client IP that caused the alarm.

When an alarm is triggered with this action has been enabled, an SNMP trap (version 2c) is sent to the SNMP management console. The SNMP server and community must be configured properly for this action to work. For more information, see “SNMP server” in the Foglight Experience Monitor Installation and Administration Guide.

The trap contains information about the alarm such as the measure value that triggered the alarm and the exact time it occurred. A link to the MIB that defines the fields in the SNMP trap is located on the SNMP Server configuration page.

Related Documents