Well since I haven't had time to post some blog entries this month, I thought I'd do so while fighting this really big problem with certain client file servers.

 Over the past few months, I've noticed increasingly more intermittent VPN tunnel connections using the Sonicwall GVC.  Now and then my VPN connection would just "die", yet my local internet was fine.  I could drop the tunnel and reconnect – everything would be fine.  For a while.  An inconvenience but nothing to cry about.

As time went on, this became more frequent.  Recently it's gotten to the point where I have bounce back-and-forth between the primary and backup firewalls trying to get a VPN connection that will stay working for 5 minutes.  Wow has it been frustrating.  And it always seemed to be when the traffic load got higher.  Like when a large image appeared or a logo screen popped up during RDP sessions.

These servers were pretty static for the last 2 years except for the usual Microsoft updates.  All were HP Compaq DL380 G3 units running Windows Server 2003.  W2K3 had been patched up to SP2 however I forget how long ago.

All along, we never really noticed any "internal" LAN issues.  Occasionally I'd get complaints on speed but usually the problem was gone by the time I was on the case (within 15-30 minutes of the report).  Chalking that up to typical network bandwidth spikes was easy.

But, enter client #2.  These guys have brand new DL360 G5 units running all sorts of the newer drivers, one even running Server 2008 SP1.  Client #2 starts seeing some really strange traffic issues, like DHCP suddenly not responding to station requests.  Bouncing the switch seemed to help, rebooting the DHCP server (W2k8 box) fixed it one time.  Then those issues went away for about a week so it was dismissed as deployment gremlins.  All except one lonely IBM Thinkpad that simply would not get an IP from DHCP.  It could ping devices on the same switch but saw nothing on the other 48-port NetGear Gigabit switch.  Now things are getting bizarre.

So now things are interesting.  Devices that link up, see same-switch peers but cannot see across multiple switches.  Since all these devices involved both at client 1 and 2 have worked flawlessly in the past, there had to be something common to everyone.  I started going down the road of 'blame the switch' and had some credible evidence too.  But overall, I just wasn't seeing the same issues internally on the scale and frequency I was seeing them remotely.

Then tonight, in yet another recon mission into possible causes, I noticed something odd.  Client 1 Win2K8 server was showing "no buffers" receive errors on one NIC.  But not on the OTHER nic.  This made no sense since the NICs were teamed and should be multi-casting – I should see duplicate errors on both NICs.

Finally I had something more "concrete" to work with in the search engines.  I quickly came up with a very, very long thread on the HP support forums and it darn near described my issues to the last detail!  AHA!  Gotcha 🙂

Here's the link to the huge thread of a whole lot of frustrated HP server customers:  http://forums11.itrc.hp.com/service/forums/bizsupport/questionanswer.do?threadId=1153566

Apparently there are issues with Broadcom NICs when certain advanced TCP Offload and Chimney settings are enabled in Windows.  Certain Broadcom NICs have the ability to offload some of the TCP traffic work to the NIC instead of being handled by the OS.  This features are apparently enabled in W2k3 SP2 by default and can cause all sorts of traffic issues depending on the load being put on the NIC.

The HP Compaq solution is to update the firmware on the NICs, disable the offending features in the registry or NIC advanced settings and then updating the HP NIC drivers.   The specific steps listed (your specific cp*.exe file may differ):

1: Upgrade Bios and firmware from disk FW800.2008_0207.37.iso (firmware-8.00-0.zip)

The order is important as every time you install the driver the settings goes back to default wich is enabled.
2: Upgrade drivers: cp008415.exe
3: Upgrade NCU (Network Configuration Utility) cp008413.exe

4: Edit registry.
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
EnableRSS == 0
EnableTCPA == 0
Enabletcpchimney == 0

5: run the following command: Netsh int ip set chimney DISABLED

6: Go inside NCU and on each nic go to advanced settings:
Remove the enabled tick for TCP offload engine.
Remove the enabled tick for Receive-Side Scaling.

7: Boot

I'm going through all of this tonight – so far it seems to have helped Client 2 with the newer G5 servers.  However I'm not seeing the same improvement with client 1 and the G3 servers.  It could be that Client 1 has 12 servers and I've only updated 1 – the problem now is keeping the VPN connection alive long enough to get the updates done on the other servers.   We shall see………