First, regarding their specific situation. It does appear to be true that if their cluster's utilization is between 80 and 83 percent then this rule will not fire for them at either severity. I believe this is a factor of the number of servers in their cluster. For example, 80-83 percent utilization should fire the critical condition for a cluster with 5 servers (instead of 6).
In other words, the critical severity is indicating that a cluster with 6 servers is sufficiently redundant in resources to handle the loss of one server when the utilization is at or below 83%.
If the utilization is higher than this that is when the critical alarm would be raised for a cluster with 6 servers. This is sliding scale where the utilization needed to trigger the critical alarm is in relation to the number of servers in the cluster.
As we've already established, the warning condition only applies if the utilization is below 80%. The essence of it is that the utilization is adjusted to factor in the loss of a server in the cluster. If the adjusted utilization value is above the registry variable (currently 83%) then the warning alarm would fire. This, too, is a sliding scale where the utilization needed to trigger the warning alarm is related to the number of servers in the cluster.
To the second point about making changes to the rules. I would also like to offer a caution in this area. Once a rule is edited it becomes a custom rule and when Foglight is upgraded at a later date any changes made to the delivered rule will not be applied as custom rules are not upgraded. For example, if the code is reviewed and we find a more efficient way to process this rule these improvements would not be included when this customer upgrades.
For the final question that was raised. Rules will try to fire the highest severity alarm first. If the fatal condition (if active) does not fire it will assess the critical condition and then move on to the warning condition. As soon as one of them evaluates to true the alarm message is generated and it will stop checking the other conditions. In the case of the cluster redundancy rule the research likely revealed that if the utilization for CPU or Memory is 80% or higher then this amount of activity is too large to be considered a warning condition for this particular rule. If this rule is to fire an alarm for that level of utilization it would have to be a critical alarm. Whether or not the critical alarm is generated is based on where the evaluation of data fits on the scale factoring in the utilization and number of servers.
One other thing I would like to add is that there are other rules monitoring cluster resource utilization. These check for significant changes in utilization and raise alarms accordingly. In the case of the CPU resource this alarm is also raised if the CPU utilization is sustained against defined thresholds. These rules will also help administrator's assess whether the levels of activity on the resources in their clusters require attention.