Generic steps to troubleshoot and investigate database agent rule and alarm issues.
Database agent rules and alarms not firing.
Alarms are fired with the incorrect values.
Alarms are delayed.
Duplicate alarms are fired.
Disable alarm in the database agent administration panel
Threshold is too high for a rule to fire.
Email is not configured or is disabled in the agent administration panel
Alarm sensitivity is set for some alarm severities not to be displayed.
Disabled rule(s) in the Rules Management dashboard
Alarms Service is stopped.
FMS lacks sufficient resources to process monitoring data or generate Alarms or Email Notifications.
No matching data has been collected to trigger an alarm.
Blackout is configured
High number of alarms in the alarms table
Email received is from a different FMS
Customized or multiple (custom) copies of the same rule.
Unsuccessful connection to host or database
Collection frequency is set higher than normal or the event occurs between agent collections
Email server settings not configured
Oracle Alert log filtering or SQL Server Error log filtering for individual and summary alarms is set to only fire Fatal or OFF.
Rule is cloned from another rule and is misconfigured or cannot be managed using the Database Administration UI.
Data issues from the query used by collection
Some issues may have been temporary, such as due to a long running query, intensive use of TEMP tablespace, or a tablespace autoextended after the alarm fired.
Enable the alarm as per KB 103065
Refer to KB 93998 for details on changing thresholds for database monitoring agents.
* For testing purposes, temporarily set the threshold to a very low value (for example 1%) to confirm that the alarm is working and fires.
Review KB 232091 to configure the alarm sensitivity to fire the appropriate severity levels.
* For testing purposes, set the agent to tuning, and confirm that alarms are fired.
Enable the rules in the Rules Management dashboard by navigating to Administration | Rules & Notifications | Rules
Restart the FMS process
It's possible on very under-utilized systems that no alarms are fired due to the configured Rule/Alarm criteria or simply that Alarm thresholds are not met. Also If the agents are stopped or are unable to collect data alarms will never fire, except for Agent Health, Credential, Availability or Connection specific rules. Review the Agent dashboards for current expected data.
As a test, drill down on the instance from the Database dashboard | click on Activity | Real Time | check the Availability to see if there is blackout alert.
Please refer to KB 80646 for more information on counting alarms in the alarms table.
Please refer to KB 66577 for more information on purging the alarms table.
If the customer receives alarms and the alarm or rule has been disabled, please confirm if there is a second Foglight Management Server (FMS) environment.
Rule has been customized (emailaction, fire multiple times, copied rule). Reviewing the rule name and the alarm format and email text to look for non-default text can be an indication that a rule has been customized.
Compare the rule conditions to the same rule on a FMS system that has not been modified and uses the same cartridge version.
Set the database agent collection for the alarm metric to query the host more often. Refer to KB 177914 for details on changing database agent collection frequencies.
Check the SMTP and email configuration on the FMS server as per KB 69979.
Set the SQL Server Error Log Filter or Oracle Alert Log Filtering to fire for Warning alarms.
A fix specific for the "DBO Usability Connect Availability" rule / alarm is available in the Oracle 184.108.40.206 and higher cartridge versions. Please consult KB 259302 for more details.
Enable and configure the original "out of the box" rule using the database agent administration panel.
Run a query similar to the database agent's collection and compare the results to the alarm message.
Access the database agent's raw topology metrics using the Configuration | Data pages and choose a time range in the zonar that corresponds to when the alarm was fired and review the actual data to identify any temporary changes in the host system (e.g. afterhour downtime, network issues, high IO due to backups).