SmartDisk appears to "crash" or go offline and needs to be restarted. The failure is characterised by the following messages appearing near the end of the monitortrace file:
Slave has exited unexpectedly with status 0
Slave seems to have exited unexpectedly!
Coupled with corresponding messages in the PercolatorSlave trace file indicating that the monitor has exited at the same time.
This message pair across processes indicate that this is a timeout scenario, rather than one process actually exiting.
NVSD will regularly check to make sure all processes are running. If this check says a process has died, then it will shutdown all other process. Additionally, the check has a timeout of 120 seconds, so if the check does not complete by that time, then NVSD will shutdown all processes.
A timeout could be due to a failure of the communications channel between the PercolatorMonitor and PercolatorSlave processes, be caused by external system conditions blocking the SmartDisk processes for a long time. For example, under high system load, particularly when there are a large number of items on the dedupe queue or a large backlog of retirement requests in the NVBU database.
If this is purely an IPC timeout then it may be possible to temporarily work round the problem by including the following lines in $IDP_ROOT/foundation/etc/percolator.cfg and restarting SmartDisk
[CheckExitedProcess]
MaxFails=N
This will cause the PercolatorSlave to survive N consecutive timeouts before shutting down. The value N may be set in the range 1 to 20. The default is 1. Start with a low value such as 5, any higher could delay conditions where NetVault really should shutdown.
If unsure if a crash is a timeout or an exited process, then escalate to R&D for guidance.