All of the agents (Infrastructure and SQL Server) running from one Foglight Agent Manager (FglAM) stopped collecting data after upgrading the SQL Server cartridge. When restarting the FglAM, some of the agents collect data for a short while.
The following symptoms were noted in the FglAM thread dumps and the FglAM logs:
1. Many threads that are BLOCKED at either ComThread.InitSTA or ComThread.Release. This is where the FglAM code calls into the Jacob library and contention occurs:
Thread: Quartz-17, id=332, priority=5, state=BLOCKED, thread group=Quartz
com.jacob.com.ComThread.InitSTA(ComThread.java:53)
...
Thread: Quartz-213, id=6530, priority=5, state=BLOCKED, thread group=Quartz
com.jacob.com.ComThread.Release(ComThread.java:125)
...
2. Many re-attempt messages for the agent lock are logged. This is because the lock takes a long time to be acquired due to contention.
VERBOSE [IncomingMessage-5] com.quest.glue.core.comms.handlers.AbstractAgentLockMessageHandler - Could not acquire an Agent Lock from Monitor@host needed before we can process the com.quest.glue.messages.schema.v1_3.SetAgentState_v1_3 message [msgId: 8b7f369e-1514-4851-9748-f3f483d03aa6]. Will re-attempt lock acquisition in 6,00ms.
3. Timeouts waiting for READ/WRITE locks:
DEBUG [IncomingMessage-4] com.quest.glue.core.comms.handlers.AbstractAgentLockMessageHandler - Timed out waiting for a lock from agent instance HostAgents/5.6.4/WindowsAgent/Monitor@> while handling message com.quest.glue.messages.schema.v1_3.SetAgentState_v1_3 [msgId: 6f438467-8561-4438-b306-fc6bfacc8ddd].
java.util.concurrent.TimeoutException: Failed to acquire the WRITE lock for agent HostAgents/5.6.4/WindowsAgent/Monitor@host because it is being held by another thread(s). One of more of those threads could be hung or deadlocked.
DEBUG [IncomingMessage-19] com.quest.glue.core.comms.handlers.AbstractAgentLockMessageHandler - Timed out waiting for a lock from agent instance HostAgents/5.6.4/WindowsAgent/Monitor@host while handling message com.quest.glue.messages.schema.v1_3.SetAgentState_v1_3 [msgId: 7d84f70e-f9a4-4c65-bb10-8e08753d030c].
java.util.concurrent.TimeoutException: Failed to acquire the READ lock for agent HostAgents/5.6.4/WindowsAgent/Monitor@host because it is being held by another thread(s). One of more of those threads could be hung or deadlocked.
4. Credential query response timeouts and agent overruns.
ECHO HostAgents/5.6.4/WindowsAgent/Monitor@host WARN [Quartz-22] com.quest.glue.core.remoteconnection.ConnectionPool - Interrupted waiting for credentials for . No credentials have been returned.
...
Caused by: java.util.concurrent.TimeoutException: Timed out waiting for Credential Query Response
WARN [Quartz-22] com.quest.glue.core.scheduler.quartz.QuartzScheduler - The HostAgents/5.6.4/WindowsAgent/Monitor@host-Collector.collect scheduled job ran longer [1,384,000 ms] than its fixed rate of 300,000 ms. 4 scheduled executions were missed.
CAUSE 1
Threads became blocked due to resource contention within the Jacob native COM library. The Jacob native COM library is what the agents use when establishing WMI connections (WMI Native) to a Windows host. All of the agents were using WMI connections to retrieve data from the OS of the Windows monitored hosts.
CAUSE 2
Unknown issue with WindowsAgent or SQL Server agent requiring it to be recreated.
CAUSE 3
The Infrastructure (IC) agent was created via agent status dashboard without any modification to the agent ASP and not Host/Infrastructure Wizard
WORKAROUND 1
If Infrastructure agents are being used to monitor Windows hosts set up WinRM on the monitored hosts. The Infrastructure WindowAgents will attempt to make a WinRM connection before trying to make a WMI connection. SQL Server agents and Oracle agents can only use WMI when making an OS connection to a Windows host.
also
Reduce the number of agents using WMI connections from a particular FglAM to 50 agents.
WORKAROUND 2
Delete the SQL Server and/or Windows agents listed in the FglAM log that have difficulties obtaining a READ or WRITE lock, then restart the FglAM service, and recreate the deleted agents.
WORKAROUND 3
Delete the problem infrastructure agent, then recreate it using the Host/Infrastructure Add Host wizard.
Related product defects and Enhancements
STATUS: Waiting for fix in a future release of Foglight for Infrastructure Cartridge.
© 2024 Quest Software Inc. ALL RIGHTS RESERVED. Feedback Terms of Use Privacy Cookie Preference Center