It was reported that version 5.2 has data integrity problems in a non-graceful system shutdown/restart condition? And that version 5.3 has major architecture improvements regarding the way that metric data is collected, aggregated, and written out to the database across different time intervals. It was also reported that this prevents the data from being lost in a crash condition, improves appliance performance, and reduces load.
In version 5.2.x and lower, you may lose 5-minute data intervals for more than just the 5-minute interval during which the restart occurred in. That is because the data is written out to the database only every hour and may have been lost. Also, your data from different time intervals is at risk because it has not had a chance yet to be "rolled up" into different 'Hourly' and 'Daily' data and written out to the database. Daily data is written out only once every day and is stored in memory in the meantime. So, this problem can greatly impact your 'Daily' data for the day upon which this problem occurred. The later in the day that the problem occurred, the bigger the impact upon the metrics. i.e. if the problem occurred early in a given day, there are still lots of Hour intervals coming later during that day that will be rolled up into the Daily data--there will be fewer missing hours.
Due to the agent only writing out data every hour, near hourly boundaries (top-of-the-hour), the data collector writes out a huge amount of data. When the appliance is already close to being overloaded, this writing can take several minutes. While those writes to disk are occurring, events pile up waiting to be handled and deleted, aggravating the overload condition and resulting in huge spikes in Memory Utilization. This can push an already borderline overloaded appliance over the edge.
In version 5.3 and higher, the appliance's architecture is different. All the hourly, daily, weekly, etc. roll-ups are broken out from the agent into entirely separate process. In other words, The data collector is no longer responsible for doing the hourly, daily, monthly integration. It is done as a second agent and now the process flushes out more frequently. This greatly reduces the memory requirements and therefore overload. Further, this second, separate agent writes out the data to flat files in the meantime before those roll-ups are done. Even if that second agent restarts/crashes, it only affects data in a single 5-minute interval. This means the affect on Daily data will be very small. Also, this improved architecture offloads the IO overhead burden from the data collector so it is not spending minutes writing out data. It takes at most 10-15s to write out data under the heaviest loads.