Hi All,
This one is very similar to
.
Really hoping the coummunity might be able to provide some direction on some networking issues I've been getting since the ESXi 5 upgrade for my site. Some details:
- ESXi 5
- 8x HP DL380p Gen8 servers
- HP PioLiant networking infrastructure
Basically since the upgrade (or I should say, fresh installation of ESXi 5) there's been 2 networking based issues that have occured.
1. Randomly a vmnic will lose connectivity to the physical network.
2. The physical network can no longer talk to the VM network through a vSwitch
The network configuration has 4 links going to 2 seperate switches (not aggregated). They tag some VLANs however ignore that element for now (and yes default VLANs are the same).
I'll start with issue 1 as I've been working through a support case with VMware that's got no where at this stage and can't progress until the issue occurs again. This morning I came to site and found that one of the ESX servers in my HA/DRS cluster was disconnected. A ping from my workstation suggested the machine was off the network. When I went to the host's console I restarted management services and found everything was OK again - with the exception that some VMs network connectivity was still down.
When I jumped into vSphere I found that one of the 4 vmnics could NOT see any observed IP range - the rest were OK. This is a single NIC too.
A then jumped into VMA and found the VMs that didn't seem to have networking connectivity were also on this vmnic. So to work-around, I placed this vmnic in the Not Used on the vSwitch and the inherited port groups of which those VMs belonged, then have connectivity. I'm willing to bet that the management interface was on that vmnic before the restart of services.
SO right now you're thinking, faulty NIC or switch configuration variation on that port? Perhaps, but what makes this odd is that this exact same issue occured on another server with another NIC (same models however). And that, I decided to do some network troubleshooting with mirrored ports. Some results:
Host A (physical) pings Host B (VM on vmnic3)
ARP broadcast gets fowarded from the switch to the ESX host however the VM doesn't get the request
Host B pings Host A
ARP broadcast leaves the vSwitch, out the uplink and makes it to Host A, Host A responds which I can see on the mirror port get sent back to vSwitch and it doesn't make it to VM
Host A pings Host B again
So now it has the physical address/IP mapping (ARP) so a directed ICMP echo is sent, it get sent to vSwitch but never hits VM
Host B pings Host A again
I had to add a static ARP entry to get the ICMP happening but the ICMP goes out to the physical device, a ICMP reply comes back to the host, but never reaches the VM.
Weird huh? VMware support said the same thing.
So I've only been testing this with the failed vmnic, so it's not going through other vmnics. I can talk across the vSwitch, but not out to the physical (or rather, the physical's responses aren't making their way back through the vSwitch.
~ # esxcfg-nics -l
~ # ethtool -i vmnic3
driver: tg3
version: 3.120h.v50.2
firmware-version: 5719-v1.24 NCSI v1.0.60.0
bus-info: 0000:03:00.3
I've checked the HCL a number of times here and the server and NIC hardware and firmware versions are supported. I did have to use the HP ESX image however, but I'm told that's still supported by VMware. I've also taken VLANs out of the mix here as. I've also swapped switches, cables, ports (both new ones, and already working ESX links) to rule out anything non-VMware.
In the end, I have to maintenance mode and restart the server to get the NIC working again. I can only assume, it's a VMware issue, the hardware is not supported (when said it is) or I've got really unlucky and there's a bad batch of Broadcom NICs getting about.
Now, as for item 2 that's a bit more intermittant. Basically vSphere administators find we can't manage VMs through the VMware console. We find that when this occurs, if we ping the ESX host's management interface we don't get a response. Other parts of the network seem OK as they have that ARP lookup in their cache. This is likely why HA remains OK.
We see that ARP request again makes it to the uplink switch, seems to get to the management vmk0 and the ARP reply goes back (I confirm this via tcpdump on the SSH console). From there I can't determine if it make it to the vSwitch but in any event, doesn't make it to the pinging workstation.
This goes on for a few minutes and then after a time, everything starts working OK. Usually triggered by another host making a connection to that host.
Any help here would be great! I've raised 2 cases with VMware but I'm not getting anywhere, I'd rather not have to wait for the issues to occur again. TO make matters worse we're looking at upgrading our control systems virtual infrastructure and calling contractors into support that process from overseas. I have to delay that until I can determine this issue.
Let me know if I've been too vague or more some specific information is needed.
Thanks muchly!
Message was edited by: Daza, topic update