Investigating and troubleshooting dashboard metric, rule, and alarm issues (multiple causes) (4308244)

Retour

Commentaire envoyé

Cet article vous a-t-il à résoudre un problème ?

Sélectionner une évaluation

Titre

Investigating and troubleshooting dashboard metric, rule, and alarm issues (multiple causes)
Description
What generic steps can be followed to troubleshoot and investigate database agent rule and alarm issues such as
- database agent rules and alarms not firing
- metrics are showing the incorrect values
- data is missing from the dashboards or alarms
- alarms are delayed
- duplicate alarms are fired.
What is the process flow for data to be collected, processed, and acted on by Foglight?
Cause

Causes for other issue resolutions (listed below)

Cause 1

Threshold is too high for a rule to fire.

Cause 2

Email is not configured or is disabled in the database agent administration panel (Databases | Settings | Administration | Alarms )

Cause 3 *not relevant for Foglight Cloud*

Disabled rule(s) in the Rules Management dashboard

Cause 4 *not relevant for Foglight Cloud*

Alarms Service is stopped.

Cause 5 *not relevant for Foglight Cloud*

FMS lacks sufficient resources to process monitoring data or generate Alarms or Email Notifications.

Cause 6

No matching data has been collected to trigger an alarm.

Cause 7

Blackout is configured

Cause 8 *not relevant for Foglight Cloud

High number of alarms in the alarms table

Cause 9 *not relevant for Foglight Cloud*

Email received is from a different FMS

Cause 10 *not relevant for Foglight Cloud*

Customized or multiple (custom) copies of the same rule.

Cause 11

Unsuccessful connection to host or database

Cause 12

Collection frequency is set higher than normal or the event occurs between agent collections

Cause 13

Email server settings not configured

Cause 14

Oracle Alert log filtering or SQL Server Error log filtering for individual and summary alarms is set to only fire Fatal or OFF.

Cause 15 *not relevant for Foglight Cloud*

Rule is cloned from another rule and is misconfigured or cannot be managed using the Database Administration UI.

Cause 16

Data issues from the query used by collection

Cause 17

Some issues may have been temporary, such as due to a long running query, intensive use of TEMP tablespace, or a tablespace autoextended after the alarm fired.

Cause 18

Alarm is disabled in the alarm template that is assigned to the target.
Résolution
In situations where data is shown incorrectly in a dashboard or an alarm misfires, the first step in investigating is to differentiate whether the timeouts/missing data is caused because by a collection issue, a configuration setting, the rule, or the FMS itself.

The general flow would be:
1. Note the Agent UI for missing data or timeouts per specific screen/collection
2. Review the metrics collected in the Foglight panels or in the topology in the context of understanding the collection's purpose.
3. Check the Agent log and to look for a failing collection
4. Check the agent configuration
5. Review the alarms and historical alarms
6. Review the rule used to fire the alarm
Metrics

All metrics are collected from the Monitored Host and and submitted to the FMS, registered to Metric objects.

A Metric is an object in the FMS that holds the data in several data segments:
- History segment - which contains the whole information collected for that metric
- Latest segment - contains the latest entry for the metric, regardless of the selected time period
- Period segment - contains the Average data (value) of the metric. For example CPU usage calculated average according to the selected tome period
- Current segment - which has the data of the last entry in the metric, per the selected time period
  The segments of Latest, Period & Current are a breakdown of the History general segment of the metric.
The Current data segment and the Latest data segment entries are the same when the user chooses to see the "Last...", like "Last hour" for the selected time period.

INVESTIGATION: To investigate "raw" metrics, users can drill down into the FMS topology through various means including Configuration | Data or Administration | Tooling | Script Console

Collections

A collection is typically a SQL statement or OS command(s) that are running by a database agent from the FglAM against the monitored host.
- If the data is successfully retrieved by the agent, typically no messages will be printed to the database agent log unless the FglAM is running in "debug" mode.
- If there is a collection failure or an error is encountered, then a log error message is printed to the Agent log file indicating a proximal cause such as a query timeout, data issue (such as variable size), and incorrect permissions.
Agent Managers (and some agent types) can be run in debug mode to write additional details to the agent log files.

INVESTIGATION: Review the database agent log file to look for an query errors that correspond to the collection which is being investigated (e.g. Top SQL). Analogous SQL or OS queries can also be run directly against the monitored host to see the results outside of Foglight.

Agent Configuration

Each agent is managed by the settings configured in the Agent Status Properties for each agent. These settings include
- connection details (e.g. hostname, user id, password, SSL)
- collection frequencies (timeouts and how often the collection runs)
- other collection specific values (e.g. batch size) or flags (enable/disable a setting)
Many database agents (e.g. Oracle, SQL Server, DB2) list all of the agent configuration settings each time a database agent is restarted.

INVESTIGATION: Review the current settings for the agent in question and make relevant adjustments such as
- increasing the amount of rows collected
- increase or decrease the frequency of how often the collection runs
- increasing the length of time before a query times out
- use the validate connection option in some agents to reset the username and password
Alarms

If a rule processes the Metric data and determines that the thresholds point to exception (for example, CPU usage is over 80, and 80 is the threshold which beyond an Alarm should be fired), an Alarm will be triggered.

According to each Rule definition (thresholds, Baseline thresholds, Boolean), an Alarm will be fired.

Rule with thresholds/Baseline thresholds can trigger an Alarm with the below severities:

Color

Severity

Red

Fatal

Orange

Critical

Yellow

Warning

Those are Multiple Severity Alarms.
- Rule which is based on Baseline thresholds will fire according the Baseline thresholds.
  Baseline thresholds have their own learning curve and inner algorithm and cannot be modified.
Baseline Alarms are considered as Multiple Severity Alarms.
- Besides the Multiple Severity alarms, there are the Simple Rule alarms.
  those alarms have only one severity defined, and the color will be one as well (yellow, in most alarms).
INVESTIGATION: Review the details of the alarm messages itself and look at the history of the alarms to identify any patterns.

Rules

The data gathered in each of the Metrics and its data segments is constantly being checked by the Rules.
A rule is defined per each metric, and the rule checks the metric data in order to alert the user, if required.
The alert is in the form of Alarm and mail notification (if configured).
In order for the Rule to "know" if an Alarm should be triggered, the Rule uses either thresholds or Boolean condition.

The types of Rules are:
- Rules that are checking strict thresholds numbers
- Rules that are checking Baseline thresholds
- Rules that are checking Boolean conditions (like yes/no, enabled/disabled)
INVESTIGATION: Review the design of the rule and compare the rule to an "out of the box" copy of the rule to identify a rule has been modified by a user. If they rule has been customized, restore the rule back to its original design to confirm that any issues still exist with the rule. As well, review and adjust the thresholds used by rules to trigger alarm conditions.

Other relevant issue resolutions

Resolution 1

Review the thresholds set for the alarm to fire in the Alarm Template that has been assigned to this target agent.

* For testing purposes, temporarily set the threshold to a very low value (for example 1%) to confirm that the alarm is working and fires.

Resolution 2

Enable the Alarms Email Notifications as per KB 4310657.
Resolution 3

Enable the rules in the Rules Management dashboard by navigating to Administration | Rules & Notifications | Rules

Resolution 4

Restart the FMS process

or
1. (Re)Start the Alarm Service via the JMX console while logged in as 'Foglight' user:
2. Browse to: http:///jmx-console/HtmlAdaptor?action=inspectMBean&name=com.quest.nitro%3Aservice%3DAlarm
3. Search for and Invoke the void stop(), then void start() Mbeans.
Resolution 5
Options include:
- Review both the FMS and FglAM Memory allocation and Java Heap assignments (-Xmx/-Xms), and adjust per the applicable Foglight product guides.
- Restart the FMS process, or
- Reboot the FMS, or
- Failover the FMS if HA is configured, then fail back to the expected Primary node if a service outage is not possible at the time.
Resolution 6

It's possible on very under-utilized systems that no alarms are fired due to the configured Rule/Alarm criteria or simply that Alarm thresholds are not met. Also If the agents are stopped or are unable to collect data alarms will never fire, except for Agent Health, Credential, Availability or Connection specific rules. Review the Agent dashboards for current expected data.

Resolution 7

Confirm if there is an active Blackout set for the instance or Agent
- If yes, delete the blackout from the Global View Databases dashboard via Settings | Manage Alarm Blackouts or from Administration | Setup | Blackouts | Manage Agent Blackout then Remove the blackout for that instance.
Resolution 8

Please refer to KB 4295903 on counting and purging alarms

Resolution 9

If the user receives alarms and the alarm or rule has been disabled, please confirm if there is a second Foglight Management Server (FMS) environment.

Resolution 10

Rule has been customized (emailaction, fire multiple times, copied rule). Reviewing the rule name and the alarm format and email text to look for non-default text can be an indication that a rule has been customized.

Compare the rule conditions to the same rule on a FMS system that has not been modified and uses the same cartridge version.

Resolution 11

Validate connectivity to the host as per KB 4235896 (Oracle), 4229902 (SQL Server) and 4289409 (DB2, see Resolution #1)

Resolution 12

Set the database agent collection for the alarm metric to query the host more often. Refer to KB 4308784 for details on changing database agent collection frequencies.

Resolution 13

Check the SMTP and email configuration on the FMS server as per KB 4352966.

Resolution 14

Set the SQL Server Error Log Filter or Oracle Alert Log Filtering to fire for Warning alarms.

Resolution 15

Enable and configure the original "out of the box" rule using the database agent administration panel.

Resolution 16

Run a query similar to the database agent's collection and compare the results to the alarm message.

Resolution 17

Access the database agent's raw topology metrics using the Configuration | Data pages and choose a time range in the zonar that corresponds to when the alarm was fired and review the actual data to identify any temporary changes in the host system (e.g. afterhour downtime, network issues, high IO due to backups).

Resolution 18

Enable the alarm in the alarm template; refer to Editing Alarm Templates (4226517) for more information on how to disable or enable the alarms.

Color	Severity
Red	Fatal
Orange	Critical
Yellow	Warning

Commentaire envoyé

Cet article vous a-t-il à résoudre un problème ?

Sélectionner une évaluation

Demander un article de la base de connaissances

Contenu recommandé

Produit(s) :: Foglight for Databases
7.3.0, 7.1.0, 6.3.0, 6.1.0, 6.0.0, 5.9.7, 5.9.5, 5.9.4, 5.9.3, 5.9.2; Foglight Evolve
7.3.0, 7.1.0, 6.3.0, 6.1.0, 6.0.0, 9.3, 9.2, 9.1, 9.0; Foglight
7.3.0, 7.1.0, 6.3.0, 6.1.0, 6.0.0, 5.9.8, 5.9.7, 5.9.5, 5.9.4, 5.9.3, 5.9.2, 5.7.5; Foglight Cloud
Hosted

Sujet(s) :: Technical Solutions

Historique de l’article :: Créé le : 9/26/2013
Dernière mise à jour le : 12/18/2024

Rechercher tous les articles

Veuillez sélectionner votre produit

Afin de mieux répondre à vos besoins, précisez l'objet du tchat:

Solutions recommandées pour votre problème

Investigating and troubleshooting dashboard metric, rule, and alarm issues (multiple causes) (4308244)

Titre

Description

Cause

Causes for other issue resolutions (listed below)

Cause 1

Cause 2

Cause 3 *not relevant for Foglight Cloud*

Cause 4 *not relevant for Foglight Cloud*

Cause 5 *not relevant for Foglight Cloud*

Cause 6

Cause 7

Cause 8 *not relevant for Foglight Cloud

Cause 9 *not relevant for Foglight Cloud*

Cause 10 *not relevant for Foglight Cloud*

Cause 11

Cause 12

Cause 13

Cause 14

Cause 15 *not relevant for Foglight Cloud*

Cause 16

Cause 17

Cause 18

Résolution

Metrics

Collections

Agent Configuration

Alarms

Rules

Other relevant issue resolutions

Resolution 1

Resolution 2

Resolution 3

Resolution 4

Resolution 5

Resolution 6

Resolution 7

Resolution 8

Resolution 9

Resolution 10

Resolution 11

Resolution 12

Resolution 13

Resolution 14

Resolution 15

Resolution 16

Resolution 17

Resolution 18

Leave a Comment

Cause 3 not relevant for Foglight Cloud

Cause 4 not relevant for Foglight Cloud

Cause 5 not relevant for Foglight Cloud

Cause 8 **not relevant for Foglight Cloud*

Cause 9 not relevant for Foglight Cloud

Cause 10 not relevant for Foglight Cloud

Cause 15 not relevant for Foglight Cloud