IMPORTANT: If you are seeing gaps in the data but the 'Agent Restarts' metric _is_ zero, refer to SOL39409.
IMPORTANT: The error entry in the log file for this situation is very similar to the entry in the log file that is described in SOL36420. Make sure you are reading the correct article that matches the error that you are seeing in the logs.
We are seeing gaps in our data. For example, in the System Health category 'Packets Captured' metric, for some 5-minute intervals, the chart goes completely flatline. It shows that there was no data collected at all during that hour even though we know there was actual traffic at that time. Even more worrying, our Daily data metrics for that entire day in which the the problem occured are way too low.
We looked at the System Health category 'Agent Restarts' metric and see that is non-zero. We also noticed that when these Agent Restarts occur, they seem to happen near the top of an hour/ "on-the-hour".
Also, we checked the log files and we see the message "ERROR: Memory consumption has exceeded threshold" (or "ERROR: CPU consumption has exceeded threshold")
Feb 2 15:07:56 fxmvericenter statmon: (3) ERROR: Memory consumption has exceeded threshold.
Feb 2 15:07:56 fxmvericenter statmon: (6) Memory consumption statmon 3346 0
Feb 2 15:07:56 fxmvericenter statmon: (6) Memory consumption httpd 4845 0
Feb 2 15:07:56 fxmvericenter statmon: (6) Memory consumption ecdbpushd 14049 0
Feb 2 15:07:56 fxmvericenter statmon: (6) Memory consumption ecpushd 2215 0
Feb 2 15:07:56 fxmvericenter statmon: (6) Memory consumption agent 11680849 71
Feb 2 15:07:56 fxmvericenter statmon: (6) Total Resident Memory 16441484000 50
Feb 2 15:07:56 fxmvericenter statmon: (6) Memory consumption exceeds limit - Going to restart agent
Feb 2 15:07:56 fxmvericenter statmon: (6) Skipping ProcessCheck, ProcessMgr has process locked. agent
Feb 2 15:07:56 fxmvericenter statmon: (6) Skipp
The appliance is trying to monitor too much traffic and/or your appliance configuration is resulting in the agent maintaining too many objects (User Sessions, Page, etc.) in memory. This is overloading the appliance--refer to SOL38831.
This specific error occurs when the System Health metric category 'Memory Utilization' metric exceeds 85% (or 'CPU Utilization' exceeds 85% if you are getting the CPU error). This is the memory (and CPU) consumption levels which the appliance's internal system process manager ("statmon") attempts to detect. If it does detect that condition it will restart a process to keep the system from locking up and crashing. Its typically the "agent" process that gets shutdown because it is the biggest hog in terms of resources. Sometimes it is the "ecdbpushd" process that gets shutdown. This is also a resource hog that loads data into the database.
During the "agent" restart, the appliance is not monitoring traffic. This results in the missing data points/ gaps in your data. This can also cause data loss in addition to the 5-minute interval too.
This is not a graceful shutdown/restart. If you are using version 5.2.x or below, it can result in lose of data for that entire hour interval. And depending upon when the restart occurred during the day, it can drastically impact the 'Daily' metrics for that day. Refer to SOL39418 for specifics. Version 5.3 has major architecture improvements. It can handle more load before this problem occurs. Also, only the data for the 5-minute interval during which the restart occurred is at risk of being dropped. More about that in SOL39418 too.
Note: The stamon detection doesn't always get a chance to activate however if there are sudden spikes in memory (or CPU) consumption which can be caused by running a Top X Pages (or some other high volume category) query against the database, especially if the appliance is already on the edge of being overloaded. This can bring on situations where
You need to reduce the overload burden on the appliance. Refer to SOL39381 and SOL31887 for specifics.
© 2021 Quest Software Inc. ALL RIGHTS RESERVED. Feedback Terms of Use Privacy