Azure agent stops collecting data after several days
说明
The Azure agent stops collecting data and it is not possible to stop it and restart it without restarting the Foglight Agent Manager. Collections resume after the restart, but the problem returns after several days.
The Azure agent may show connection errors when querying Azure REST APIs: WARN [Azure-CollectorManager-15] com.quest.agent.azure.collector.StorageAccountPerformanceCollector - Error occurred when collecting metrics with Azure monitor REST APIs for StorageAccount: storageAccount-QUEUE. HttpResponseWrapper: [responseCode: 503, message: Service Unavailable, response content: {"error":{"code":"ServerTimeout","message":"The request timed out. ... ]
When metrics stop displaying in the dashboards, the Azure agent is active but in the log it stops reporting collection times: INFO [Quartz[0]-3499] com.quest.virtualization.utils.AbstractAgent - Finished to collect Data, cost 92,776 ms.
The following messages can be present in the agent logs when trying to deactivate / activate: WARN [FglAM:IncomingMessage[4]-27380] com.quest.glue.core.agent.StateChangeRequestProcessor - Could not acquire lock needed to move agent fb8be857-fbcf-4747-973d-295a8570e581/AzureAgent/6.3.0/AzureAgent/AgentName to request state NOT_RUNNING. The agent might be in an unexpected state. java.util.concurrent.TimeoutException: Failed to acquire the WRITE lock for agent fb8be857-fbcf-4747-973d-295a8570e581/AzureAgent/6.3.0/AzureAgent/AgentName because it is being held by another thread(s). One of more of those threads could be hung or deadlocked. This occurs when an agent performs a long running operation in a method that holds the agent lock and declines to return it to the agent manager. Agents are required to respond to InterruptedExceptions and are strongly encouraged to break long running operations into smaller pieces so that they can respond in a timely fashion