From time to time agent data is missing. There is no regularity. Sometimes data for agent A is missing sometimes data for agent B is missing. Also the time is changing
When monitoring several SQL Server and all of them are using the same DB user + password to connect to the DB. Now one DB connection will be created at time and that one will be rotate through the agents. So one connection for all agents (if they use the same user + password).
The connection workflow:
Foglight opens only one connection at a time to prevent the credentials locked and if for some reason there is an agent which cannot be connected, it stuck all the other agents from running their collections. That is the reason why some of the data do exist and others not.
If one agent is having problems to connect to the DB (for example DB or network is slow), it will delay the collections of all agents. In worse case there is not data at all. On the agent status dashboard all agents are displayed as up and running, none of the agent is broken. Even in the log files you cannot see that an agent is waiting for another agent to finish its collection. It is only visiable in the FlgAM thread dumps.
How to investigate:
Create FglAM thread dumps (or even more than one) with the command "fglam -t".
In the file search for this string : "ConnectionManagementThread"
The below is example from one of the files:
Thread: ConnectionManagementThread-[MSSQLPool-DBSS-MyHost-MySQLServer]-[Wed Feb 20 15:30:26 CET 2013][MSSQLProfile{host='MyHost', instance='MySQLServer', username='MyDomain\foglight', authType=WINDOWS_CUSTOM, port='0' useNTLMv2='true' socketTimeout='900' secureConnection='REQUIRE' }], id=1428563, priority=5, state=RUNNABLE, thread group=RestoreAgent
Please look for the time difference between the connection time that is written in the log and the time that the file was generated, which can be seen in the file name itself, in our example it was “Stack Dump generated 2013-02-20T16-09-05”. That means the connection was hanging since 15:30:26 which is 39sec for MyHost-MySQLServer agent.
In general when ever that there is a connection which is hanging for more than 15 seconds, it indicates that there might be a problem here.