Quantcast
Channel: VMware Communities : All Content - VMware ESXi 5
Viewing all articles
Browse latest Browse all 18761

system outage, looking for answers

$
0
0

Need some help guys. Had a major system outage last week, and I’m a little confused as to what happened and why I was unable to recover from it gracefully. I’ve lost some of my confidence in Vsphere, and I would like to renew that.

 

So a little about the environment, currently running esxi 5.0 U1, 3 node cluster with HP Gen 8 servers. HP P4500 iscsi SAN for vmfs datastores. Esxi is installed on HP CF cards in the servers. This cluster has been up and running for about 6 months, not a single hiccup before this.

 

On Wednesday we migrated 2 of our database vm’s  (SQL and Progress) to this 5.0 cluster, the vm’s went on different hosts in the cluster. The vm’s were running on an older 2 node cluster with esxi 4.1 U2. The machines migrated fine, updated tools fine, reboot, everything seemed well.

 

About 2 hours after that migration, we started getting calls that our Progress database was down, users couldn’t connect. Then we started getting more calls, other machines were unreachable. Upon looking at vcenter I could see all the vm’s in question were on one host, and I was unable to open the console within vsphere for vm’s on that host. The host showed it was connected, showed all the vm’s connected, but I couldn’t open the console or remote desktop to any of the vm’s on that host. I started investigating, and sure enough that host had dropped all connections to the Iscsi SAN. The paths showed dead. The nic card ports looked active, the switch ports showed activity, but the connections were down. I could still ping the management address for the host though, and the ports for vmotion were still up.

 

At this point I started to try and vmotion the progress database vm off that host, it wouldn’t migrate, just sat at 8% preparing to migrate. I tried other vm’s with the same result. I began to wonder why HA hadn’t kicked in, and why I couldn’t migrate anything. At this point the host started disconnecting from the cluster. I could still ping the host, but vsphere showed it as disconnected. I couldn’t move any of my vm’s, and I couldn’t get at the host through vcenter, through the vsphere client pointed directly to the host, or using the DCUI.

 

So I called VMware support, got an engineer on the line with me, and it became evident we were going to have to power cycle that host, and crash all the vm’s running on it. This wasn’t a very pleasant response for me because that Progress database is our main production system, and I was afraid of corruption. We had no choice though, so we did it. When the host came back up, luckily the vm’s came up okay. The VMware engineer was digging in logs, and said the issue was with a particular NIC card driver with known issues. He showed me the kb on it, and it seemed like a known problem. We updated those drivers on all the hosts, and that was it.

 

My concern with this is that, how come the cluster ran fine for 6 months without a problem, and how come the redundant path to the SAN didn’t keep the connection active when with the path with the bad nic card driver failed? I have 2 different paths, with 2 different NIC cards to the Iscsi SAN. The other card had no known driver problems. Why did both paths fail just because of a known driver problem on one of the cards? Also more concerning is how come I couldn’t migrate anything, and why didn’t HA kick in?

 

Sorry for the Novel here, but without all the details it doesn’t make much of a story. My biggest concern is why couldn’t I migrate anything? In the event of a host failure, what are you suppose to do in order to migrate machines if they won’t migrate through vsphere client? We were down for about 2.5 hours, and lots of questions have been thrown at me from upper management as to why my “high available, redundant system” took hours to recover…

 

Any ideas here guys, thoughts on how I should have handled it differently, reasons why I should be confident everything is fine now??

 

Thanks for your time

Kevin


Viewing all articles
Browse latest Browse all 18761

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>