Our APM suddenly stopped getting new data to the Archiver. The transfer queue(aka upload queue) on the sniffer appliance is 100% full.
The Archiver log and in the EndUser log, they are both reporting problems with the configuration. Errors like this:
Line 89,866: Error generating config response: java.io.FileNotFoundException: /quest/foglight/FMS/eudata/euconfigdata/configuration-resources/4d621305-60ea-41f2-a526-28ca4b4ee0be.368.o
And in the Relayer log from the Sniffer, we are seeing entries like this:
Mar 23 14:20:00 hostname relayer: (6) UploadMgr --> HitsCaptured:0, PagesCaptured:0, Discarded:125122, QFullHits:101719, QFullPages:23403, ShmFull:0, LadTO:0, MaxSize:0, CapUplErrs:0
Defect APMDATA-1978, in the EndUser cartridge that may manifest itself when it receives the call to become the primary in HA failover. Restarting the EndUser cartridge is part of that process. If more than one EndUser cartridge is present on FMS the incorrect cartridge can sometimes be enabled during this process due to this bug. And that, in turn, breaks the configuration.
More detail--Upon HA failover, the defect is that the system may restart an old 5.9.3 version of the EndUser cartridge (restarting the EndUser cartridge is part of the HA failover process), instead of 5.9.7. And then before that was even fully complete, the issue triggering the failover can clear up and this FMS becomes the secondary FMS again, and the original FMS primary takes back over as the current primary. It gets communicated from the secondary to the primary that there was a change in version of the End User cartridge, so a message gets propagated back to the primary FMS to roll End User cartridge back to 5.9.3. That can happen and that's when the configuration breaks. When it went back to 5.9.3, the system didn't know how to deal with a 5.9.7 configuration.
The logs got flooded with those FileNotFoundExceptions because the system is trying to download the configuration and build the configuration for the Archiver, and the system cannot successfully build the configuration because of a problem in the configuration. And the Archiver is throwing Exceptions. At that time, data capture stops.
That causes the configuration download by the Archiver to fail, which causes the Archiver to not process data, which causes the upload queues to fill up.
This defect can only occur with a FMS that happens to have more than one "EndUser" cartridge installed in the cartridge inventory and is setup in a HA environment.
Fixed in April 2015 APM patch 220.127.116.11, https://support.quest.com/foglight/kb/150492. APMDATA-1978: Fixed an issue that was causing the EndUser cartridges to be downgraded when an HA fail-over event occurred.