Multiple backups on different nodes failing at the same time while writing backups to DR Series.
In the binary log we see Write operation failed 'Service is down' and if we look at the more info of this we see RDA Library error code: 39.
“ROFS error 39” seen in the trace indicates that the DR service has gone down or that the service is unavailable.
The wireshark running at the time of the backup reports a TCP reset by the client trying to write to the DR as seen below.
Extract from binary log:
Error 2016/05/11 15:34:46 153 Data Plugin E13PWG06 Write operation failed 'Service is down'
RDA Library error code: 39
Error 2016/05/11 15:34:46 153 Data Plugin E13PWG06 Stream has gone down
Extract from exchange plugin trace:
6 PLGLIB ??? 2408 75 0 84838192121 Have 524288 normal data to dispatch
6 DRLAYER ??? 2408 46 0 84838192121 RDAFileSysWrite(0000000001841C90, 00000000051F0098, 524288, 7012876288, 0000000000E1E820, 0000000000E1E828)
6 DRLAYER ??? 2408 47 0 84838192121 Writting rofs_fh_t 0000000002BE0D40
6 DRLAYER ??? 2408 48 0 84840879612 Got return code 39
6 DRLAYER ??? 2408 49 0 84840879612 Bytes written 0
2 DRLAYER ??? 2408 178 0 84840879612 Mapped ROFS error 39 to Foreign RAS error 2
0 DRLAYER ??? 2408 4 0 84840879612 Write operation failed 'Service is down' - return code 39
0 PLGLIB ??? 2408 2049 0 84840879612 Failed to write 524288 bytes to RAS local FS output file
0 PLGLIB ??? 2408 2249 0 84840879612 Foreign RAS error = 2
Extract from WireShark:
3202 2016-05-11 17:01:22.798108000 172.30.182.113 172.30.182.17 TCP 54 21410 > irisa [RST, ACK] Seq=620079 Ack=3607 Win=0 Len=0
.... .... .1.. = Reset: Set
When large number of backups run to the DR Series and the DRs become busy storing the data some windows clients exceed TcpMaxDataRetransmissions value of 5 and the stream can be dropped.
Windows default value for TcpMaxDataRetransmissions is 5 this value need to be increased.
Fix:
Upgrade to minimum version of netvault 12.2.0.18 where RDA plugin bug QS-1870 fix was applied.
This fix was introduced because of difficulty in manually modifying “TcpMaxDataRetransmissions” on windows servers after windows 2008.
Workaround:
To lessen the likelihood of a peer disconnect can we create the following registry key if one is not already present. The default value is 5. Try using a value of 15 and rerun your backups. 15 is the standard setting and that matches the Linux default behaviour even though Windows has a default of 5.
Note take registry dump in regedit before making changes.
Note : this will most likely require a reboot of the server.
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters
Value Name: TcpMaxDataRetransmissions
Data Type: REG_DWORD - Number
Valid Range: 0 - 0xFFFFFFFF
Default: 5