Several duplication / datacopy jobs using the "optimized replication" feature to copy deduplicated data from one SmartDisk instance to antoher, are hung for several days and never finish.
Some datacopies appear to have copied all required data but never complete.
When examining the nvdatacopy process trace using the NetVault NVPVIEW tool, you can see that the following loop:
------------------
6 PLGDAV ??? 692 12 0 172639287752 Trying to read DAV header from socket... 0 NET ??? 692 84 0 172639287752 Connection down 0 PLGDAV ??? 692 10 0 172639287752 Error reading header from DAV server 0 PLGDAV ??? 692 1928 0 172639287752 Timed out reading response header while requesting stats
----------------------
This is due to Bug 23532. When several optimized replications are run at the same time, they may hang indefinitely as SmartDisk is too busy responding to NVBU's stats requests. Stats requests for SmartDisk are low priority and other tasks such as deduplications, read/write/delete, take precedence.
However the NVBU datacopy plugin relies on receiving stats from SmartDisk to monitor and complete a copy job. Since SmartDisk is not responding, the datacopy jobs will never finish.
There are 2 folds to this issue, SmartDisk should treat stats with higher priority and NVBU datacopy plugin should not rely on stats requests so much.
The SmartDisk 2.0.1 has introduced better handling of stats and NVBU10 will address the nvdatacopy behaviour
When several datacopy/duplication jobs that use the "optimized replication" feature are hung the following should be performed:
1) On the NVBU server, use the NVPVIEW tool from netvault/bin, to examine what the nvdatacopy processes are doing. The following repeating messages should confirm this is Bug 23532:
"Timed out reading response header while requesting stats"
2) Abort all running optimized replications/ datacopy jobs, as well as any pending jobs to prevent them from starting
3) Aborting such hung jobs is not enough as the corresponding nvdatacopy processes will have to be killed. Use NVPVIEW's option "Kill process (Soft/Hard)" to kill these processes.
4) After making sure no jobs are running to both SmartDisks and both SmartDisk instances are not in GC/dedupe, upgrade both SmartDisk instances to 2.0.1 which should reduce the re-occurrence of this issue as this version introduces better Stats handling.