Hi all,
On our newly fresh installed vSphere 5 Update 1 environment, we are seeing latency warnings.
First of all, a short overview:
2 x HP DL380 G7 (each host has 2 x QLE2560 FC Single Port HBAs)
1 x Hitachi Unified Storage 110
Everything is running at 8Gb/s over FC. The SAN is configured as follows:
- Disk0 to Disk6 = RAID-Group 000 / RAID Level: RAID5(6D+1P) = VOLUME 0200 = 3.1TB
- Disk7 to Disk13 = RAID Group 001 / RAID Level: RAID5(6D+1P) = VOLUME 0201 = 3.1TB
- Disk23 = SpareDisk
So in our vSphere environment, we have 2 datastores called "DataStore_Prod01" and "DataStore_Test01". Each datastore is using 3.1TB and has been formated as VMFS5. So each datastore has its own RAID-Group. The storage is directly connected to the hosts over FC. There is no FC-Switch. At the moment, I have 30 VMs running (15 VMs on "DataStore_Prod01" and 15 VMs on "DataStore_Test01"). Those VMs are idle at the most of the time.
Then suddenly, I detected some latency warnings on all hosts with all datastores. They look like this:
Device naa.60060e80105395f0056fc1ef000000c8
performance has deteriorated. I/O latency
increased from average value of 3298
microseconds to 197574 microseconds.
warning
10.07.2012 21:16:37
esx001.xxxx.xxxx
Device naa.60060e80105395f0056fc1ef000000c9
performance has deteriorated. I/O latency
increased from average value of 2182
microseconds to 277405 microseconds.
warning
10.07.2012 21:16:06
esx002.xxxx.xxxx
All our HBAs have the following Firmware/driver version:
BIOS version 3.00
FCODE version 3.15
EFI version 2.21
Flash FW version 5.04.01
Driver version 901.k1.1-14vmw
As recommended by HDS, I changed the queue depth to 32, because Hitachi Data Systems recommends an LU queue depth of 32 for SAS drives. The value Disk.SchedNumReqOutstanding was already at 32.
This is what VMware Support has found out:
hostname: ESX001
VMware ESXi 5.0.0 build-623860
# We have 4 vmkernel files
# Time range: 2012-05-25 06:16:54 - 2012-06-20 14:02:51, unique log entries for 22 different days
# The error: "Cmd xxx to dev xxx failed" has been reported 60 times in that period
# The sum of all SCSI error codes in all 4 vmkernel log files (possible/valid sense data, no mpx devices)
H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0 # FREQUENCY: 1
H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0 # FREQUENCY: 2
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 # FREQUENCY: 6
H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0 # FREQUENCY: 6
H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0 # FREQUENCY: 9
H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0 # FREQUENCY: 12
H:0x0 D:0x2 P:0x5 Possible sense data: 0x0 0x0 0x0 # FREQUENCY: 12
# The following LUNs were reported in combination with above SCSI error codes
naa.60060e80105395f0056fc1ef000000c8 # FREQUENCY: 20
naa.60060e80105395f0056fc1ef000000c9 # FREQUENCY: 28
# That translates into vmfs datastore names (only vmfs, not NFS or RDM)
naa.60060e80105395f0056fc1ef000000c8:1 DataStore_Prod01
naa.60060e80105395f0056fc1ef000000c9:1 DataStore_Test01
# The SCSI error codes in the vmkernel logs have been observed during 3 different days
2012-05-25 # FREQUENCY: 7
2012-06-13 # FREQUENCY: 15
2012-06-20 # FREQUENCY: 26
Host status [ Device Status [ Sense Key [ ASC + ASCQ
[ SCSI Status codes appear [ SCSI Sense Keys appear in the Sense Data [ Additional Sense Code +
These codes potentially come from [ in the Status byte returned when [ available when a command completes [ Additional Sense Code Qualifier
the firmware on a host adapter [ processing of a command completes [ with a CHECK CONDITION status [ ASC + ASCQ
=============================== =================================== ================================ ================================
naa.60060e80105395f0056fc1ef000000c8
Host: 0x00 DID_OK Device: 02h CHECK CONDITION Sense: 0h NO SENSE ASC: 00h NO ADDITIONAL SENSE INFORMATION
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 0h NO SENSE ASC: 00h NO ADDITIONAL SENSE INFORMATION
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 2h NOT READY ASC: 3Ah MEDIUM NOT PRESENT
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 5h ILLEGAL REQUEST ASC: 25h LOGICAL UNIT NOT SUPPORTED
naa.60060e80105395f0056fc1ef000000c9
Host: 0x00 DID_OK Device: 02h CHECK CONDITION Sense: 0h NO SENSE ASC: 00h NO ADDITIONAL SENSE INFORMATION
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 0h NO SENSE ASC: 00h NO ADDITIONAL SENSE INFORMATION
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 5h ILLEGAL REQUEST ASC: 24h INVALID FIELD IN CDB
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 5h ILLEGAL REQUEST ASC: 25h LOGICAL UNIT NOT SUPPORTED
----------------
hostname: ESX002
VMware ESXi 5.0.0 build-623860
# We have 4 vmkernel files
# Time range: 2012-05-25 05:49:08 - 2012-06-20 14:05:17, unique log entries for 21 different days
# The error: "Cmd xxx to dev xxx failed" has been reported 67 times in that period
# The sum of all SCSI error codes in all 4 vmkernel log files (possible/valid sense data, no mpx devices)
H:0x0 D:0x2 P:0x5 Possible sense data: 0x2 0x3a 0x1 # FREQUENCY: 2
H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0 # FREQUENCY: 2
H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1 # FREQUENCY: 3
H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0 # FREQUENCY: 5
H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0 # FREQUENCY: 6
H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0 # FREQUENCY: 7
H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0 # FREQUENCY: 17
H:0x0 D:0x2 P:0x5 Possible sense data: 0x0 0x0 0x0 # FREQUENCY: 18
# The following LUNs were reported in combination with above SCSI error codes
naa.60060e80105395f0056fc1ef000000c8 # FREQUENCY: 18
naa.60060e80105395f0056fc1ef000000c9 # FREQUENCY: 42
# That translates into vmfs datastore names (only vmfs, not NFS or RDM)
naa.60060e80105395f0056fc1ef000000c8:1 DataStore_Prod01
naa.60060e80105395f0056fc1ef000000c9:1 DataStore_Test01
# The SCSI error codes in the vmkernel logs have been observed during 6 different days
2012-05-25 # FREQUENCY: 4
2012-05-29 # FREQUENCY: 4
2012-06-06 # FREQUENCY: 4
2012-06-14 # FREQUENCY: 7
2012-06-13 # FREQUENCY: 13
2012-06-20 # FREQUENCY: 28
# The following SCSI commands were reported to have failed
6 times 0x4d = LOG SENSE
17 times 0xc0
===> Per Lun/Per Error:
Host status [ Device Status [ Sense Key [ ASC + ASCQ
[ SCSI Status codes appear [ SCSI Sense Keys appear in the Sense Data [ Additional Sense Code +
These codes potentially come from [ in the Status byte returned when [ available when a command completes [ Additional Sense Code Qualifier
the firmware on a host adapter [ processing of a command completes [ with a CHECK CONDITION status [ ASC + ASCQ
=============================== =================================== ================================ ================================
naa.60060e80105395f0056fc1ef000000c8
Host: 0x00 DID_OK Device: 02h CHECK CONDITION Sense: 0h NO SENSE ASC: 00h NO ADDITIONAL SENSE INFORMATION
Host: 0x00 DID_OK Device: 02h CHECK CONDITION Sense: 2h NOT READY ASC: 3Ah MEDIUM NOT PRESENT - TRAY CLOSED
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 0h NO SENSE ASC: 00h NO ADDITIONAL SENSE INFORMATION
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 2h NOT READY ASC: 3Ah MEDIUM NOT PRESENT - TRAY CLOSED
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 5h ILLEGAL REQUEST ASC: 25h LOGICAL UNIT NOT SUPPORTED
naa.60060e80105395f0056fc1ef000000c9
Host: 0x00 DID_OK Device: 02h CHECK CONDITION Sense: 0h NO SENSE ASC: 00h NO ADDITIONAL SENSE INFORMATION
Host: 0x00 DID_OK Device: 02h CHECK CONDITION Sense: 2h NOT READY ASC: 3Ah MEDIUM NOT PRESENT - TRAY CLOSED
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 0h NO SENSE ASC: 00h NO ADDITIONAL SENSE INFORMATION
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 5h ILLEGAL REQUEST ASC: 24h INVALID FIELD IN CDB
Host: 0x05 DID_ABORT Device: 00h GOOD Sense: 5h ILLEGAL REQUEST ASC: 25h LOGICAL UNIT NOT SUPPORTED
Conclusion by VMware-> It appears to be an issue with storage as both hosts are receiving a check condition and Aborts from Device (Storage Array).
Also, after changing the queue depth to 32 I am still getting latency warnings. And what really drives me crazy, is that I can't reproduce these latency warnings. Even with high I/O I am never getting warnings. And it looks that they always appers at xx:15 to xx:16. But there's no scheduled job, neither on the storage array nor in the vSphere environment, which could cause that impact.
HDS Support told me, that there is no problem with the storage array. We saw a fault configured cache size on the SAN but still after the correction, we are still getting latency warnings, not all the time, but sporadic (ones a day or so).
Any suggestion on how to solve this problem is very appreciated.
If you have any questions regaring my configuration don't hesitate to contact me.
Best regards'
Marc