Jobs for a client will fail with “Job manager lost message channel unexpectedly”.
When browsing the client plugins from "Manage Clients" or through a "Backup Selection", a spinning wheel shows up and may appear to hang. It may return a "Error: Provider 'NVBUPit' failed" error.
The “Check Access” and “Firewall Test” will complete successfully for the affected client.
If NetVault is uninstalled on the client and reinstalled as a NetVault Server, then it can browse its own plugins. However, while still as a server, if it’s added to another server the same would happen.
Looking at the trace logs a similar content may be found:
nvwsworker AAAA (Server)
===================
4 MESSAGE :05232 1080 0 004738.520465 Send 'CORE_PLG_GET_CLIENT_PROPERTIES_SCREEN_MSG' (2147) to NetVault:BBBB
2 WSPROV :05232 5346 0 004738.521123 Short timeout of 30 seconds
0 WSPROV :05232 3789 0 004808.523072 Failed to wait for reply from coreplugin
Cli Proxy BBBB (Server)
==============
4 MESSAGE :18896 1080 0 004738.521131 Send 'CORE_PLG_GET_CLIENT_PROPERTIES_SCREEN_MSG' (2147) to Cyborg:CCC
nvcoreplg CCC (client)
=============
4 MESSAGE :88561 1080 0 004738.575166 Send 'PLUGIN_DISPLAY_DISPOSABLE_SCREEN_MSG' (1717) to NetVault:BBBB
From the above, in essence we’re seeing that:
1. nvwsworker sends 'CORE_PLG_GET_CLIENT_PROPERTIES_SCREEN_MSG' to cliproxy BBBB
2. Which sends CORE_PLG_GET_CLIENT_PROPERTIES_SCREEN_MSG to nvcoreplg on the client
3. In the meantime nvwsworker times out
If a wireshark or tcpdump capture is done on the client. Looking through the packets, there’d a be sequence where the client is trying to send the installed plugins info and retransmitting it a couple of times before issuing a TCP_RST. This can be found several times.
On this scenario the problem was that Jumbo Frames where been used only on the Client. Although on these situations the frame can be defragmented, there were other issues within the network that didn’t allowed a Jumbo Frame to be sent.
Workaround
Use a standard MTU rather than Jumbo Frames.
Other items(as shown below) can be of help when dealing with a “Job manager lost message channel unexpectedly” situation, although maybe not with the same behavior as detailed on this kb. Most of the time it's a case by case scenario, however reviewing the following items may help diagnose the source of the problem:
1. <netvault home>\config\Machine.dat on the server and client. This file holds the Server names and IPs.
2. Software and Hardware Firewalls.
3. Duplicated IPs.
4. Forward and Reverse lookup from the IP resolving properly.
5. OS Connection timeouts (the default values can be doubled):
a. For Windows the following keys can be created/modified through regedit on HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters. Do note that a reboot would be needed to apply these changes:
i. KeepAliveTime
ii. KeepAliveInterval
iii. TcpMaxDataRetransmissions
b. For Linux, similar values can be modified through sysctl:
i. net.ipv4.tcp_keepalive_time
ii. net.ipv4.tcp_keepalive_intvl
iii. net.ipv4.tcp_retries2
6. SELinux.
7. Through the NetVault Configurator (or txtconfig on Linux). Preferred and/or barred IPs, Firewall ports.
8. If using NV 11.x.x.x. Through NetVault’s WebUI, on “Change Settings”, “Server Settings”, “Web Service”. The values Physical client timeouts (Short/Medium/Long) can be increased to the double from the original value.