fms.exe was taking 99% CPU and Admin Console became very slow, many dashboards were timing out. Tried to create a support bundle, but that took forever. We had to stop and restart fms, now it is running at 70% CPU.
When searching for service=DataCacheEviction in the Diagnostic Snapshot we see many metrics that being held in memory.
Many thousands of metrics held in memory for long periods of time will keep that in memory from being GC'ed.
Most are coming from the DB_Oracle Cartridge.
Most of the data service threads are aggregating data in memory. That could influence CPU intensive processes so likely to cause CPU spikes.
Further analysis revealed that the following retention policies are not optimal in which instructs the server to keep too much data in memory.
44da4e5f-435e-4c91-9c11-381e4cfd5648:DBO_File_Avg_Write_Time_Ms - age:259200000 granularity:300000 cached duration:259500000 num values:288 delay:29797
a8e6baf4-4a9b-4ff9-9d0f-4679d525aac2:file_min_single_io_time - age:259200000 granularity:300000 cached duration:259500000 num values:288 delay:14848
c33c853c-9077-4415-baa6-9ac563d4e733:DBO_File_Avg_Write_Time_Ms - age:259200000 granularity:300000 cached duration:259500000 num values:288 delay:16614
0eb5a5e3-82cd-4307-aa8a-965e47bfed8f:file_last_io_time - age:259200000 granularity:300000 cached duration:259500000 num values:288 delay:7721
To alleviate this, please import the attached "modified-lifecycles.xml" file using the following command after copying it into the FMS_HOME/bin directory:
$FMS_HOME/bin: fglcmd -usr foglight -pwd foglight -cmd util:configimport -f modified-lifecycles.xml
This file adjusts the retention policies for the responsible collection DBO_Datafile_IO_Activity from 3 days to 15 mins.
Data storage will now be less onerous on the server, thus reducing the probability of future CPU spikes.
Also attached is a screen shot of the aftermath of running the import. You may want to check this prior to running the import and it should indicate 3 days in the first row of the Age column.
1) Navigate to Administration -> Data -> Manage Retention Policies
2) Filter by the DBO cartridge
3) Expand the DBO_Datafile_IO_Activity policy
4) Check the second row in the Age column.
The value should be 15 minutes (900 000 milliseconds). Prior to the import, it was set to 3 days (259 200 000 milliseconds).
If you need help implementing this please acknowledge and we can setup a WebEx.
Finally FYI the latest cartridges have related embedded performance fixes.