Quantcast
Channel: VMware Communities : All Content - VMware ESXi 5
Viewing all articles
Browse latest Browse all 18761

I/O latency warnings / DL380 G7 / HUS110

$
0
0

Hi all,

 

On our newly fresh installed vSphere 5 Update 1 environment, we are seeing latency warnings.

 

First of all, a short overview:

2 x HP DL380 G7 (each host has 2 x QLE2560 FC Single Port HBAs)

1 x Hitachi Unified Storage 110

 

Everything is running at 8Gb/s over FC. The SAN is configured as follows:

- Disk0 to Disk6 = RAID-Group 000 / RAID Level: RAID5(6D+1P) = VOLUME 0200 = 3.1TB

- Disk7 to Disk13 = RAID Group 001 / RAID Level: RAID5(6D+1P) = VOLUME 0201 = 3.1TB

- Disk23 = SpareDisk

 

So in our vSphere environment, we have 2 datastores called "DataStore_Prod01" and "DataStore_Test01". Each datastore is using 3.1TB and has been formated as VMFS5. So each datastore has its own RAID-Group. The storage is directly connected to the hosts over FC. There is no FC-Switch. At the moment, I have 30 VMs running (15 VMs on "DataStore_Prod01" and 15 VMs on "DataStore_Test01"). Those VMs are idle at the most of the time.

 

Then suddenly, I detected some latency warnings on all hosts with all datastores. They look like this:

 

Device naa.60060e80105395f0056fc1ef000000c8
performance has deteriorated. I/O latency
increased from average value of 3298
microseconds to 197574 microseconds.
warning
10.07.2012 21:16:37
esx001.xxxx.xxxx

 

Device naa.60060e80105395f0056fc1ef000000c9
performance has deteriorated. I/O latency
increased from average value of 2182
microseconds to 277405 microseconds.
warning
10.07.2012 21:16:06
esx002.xxxx.xxxx

All our HBAs have the following Firmware/driver version:

BIOS version 3.00
FCODE version 3.15
EFI version 2.21
Flash FW version 5.04.01

Driver version 901.k1.1-14vmw

As recommended by HDS, I changed the queue depth to 32, because Hitachi Data Systems recommends an LU queue depth of 32 for SAS drives. The value Disk.SchedNumReqOutstanding was already at 32.

 

This is what VMware Support has found out:

 

hostname: ESX001

VMware ESXi 5.0.0 build-623860

# We have 4 vmkernel files

                # Time range: 2012-05-25 06:16:54 - 2012-06-20 14:02:51, unique log entries for 22 different days

                # The error: "Cmd xxx to dev xxx failed" has been reported 60 times in that period

                # The sum of all SCSI error codes in all 4 vmkernel log files (possible/valid sense data, no mpx devices)

                H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0                        # FREQUENCY: 1

                H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0                        # FREQUENCY: 2

                H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0                               # FREQUENCY: 6

                H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0                        # FREQUENCY: 6

                H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0                           # FREQUENCY: 9

                H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0                              # FREQUENCY: 12

                H:0x0 D:0x2 P:0x5 Possible sense data: 0x0 0x0 0x0                           # FREQUENCY: 12

 

                # The following LUNs were reported in combination with above SCSI error codes

                naa.60060e80105395f0056fc1ef000000c8                            # FREQUENCY: 20

                naa.60060e80105395f0056fc1ef000000c9                            # FREQUENCY: 28

                               # That translates into vmfs datastore names (only vmfs, not NFS or RDM)

                               naa.60060e80105395f0056fc1ef000000c8:1 DataStore_Prod01

                               naa.60060e80105395f0056fc1ef000000c9:1 DataStore_Test01

 

                # The SCSI error codes in the vmkernel logs have been observed during 3 different days

                2012-05-25         # FREQUENCY: 7

                2012-06-13         # FREQUENCY: 15

                2012-06-20         # FREQUENCY: 26

 

Host status                           [ Device Status                      [ Sense Key                                 [ ASC + ASCQ

                                      [ SCSI Status codes appear           [ SCSI Sense Keys appear in the Sense Data  [ Additional Sense Code +

These codes potentially come from     [ in the Status byte returned when   [ available when a command completes        [ Additional Sense Code Qualifier

the firmware on a host adapter        [ processing of a command completes  [ with a CHECK CONDITION status             [ ASC + ASCQ

===============================       ===================================  ================================            ================================

naa.60060e80105395f0056fc1ef000000c8

Host: 0x00 DID_OK                      Device: 02h CHECK CONDITION          Sense: 0h NO SENSE                          ASC: 00h NO ADDITIONAL SENSE INFORMATION

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 0h NO SENSE                          ASC: 00h NO ADDITIONAL SENSE INFORMATION

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 2h NOT READY                         ASC: 3Ah MEDIUM NOT PRESENT

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 5h ILLEGAL REQUEST                   ASC: 25h LOGICAL UNIT NOT SUPPORTED

naa.60060e80105395f0056fc1ef000000c9

Host: 0x00 DID_OK                      Device: 02h CHECK CONDITION          Sense: 0h NO SENSE                          ASC: 00h NO ADDITIONAL SENSE INFORMATION

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 0h NO SENSE                          ASC: 00h NO ADDITIONAL SENSE INFORMATION

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 5h ILLEGAL REQUEST                   ASC: 24h INVALID FIELD IN CDB

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 5h ILLEGAL REQUEST                   ASC: 25h LOGICAL UNIT NOT SUPPORTED

 

 

----------------

 

hostname: ESX002

VMware ESXi 5.0.0 build-623860

 

# We have 4 vmkernel files

                # Time range: 2012-05-25 05:49:08 - 2012-06-20 14:05:17, unique log entries for 21 different days

                # The error: "Cmd xxx to dev xxx failed" has been reported 67 times in that period

                # The sum of all SCSI error codes in all 4 vmkernel log files (possible/valid sense data, no mpx devices)

                H:0x0 D:0x2 P:0x5 Possible sense data: 0x2 0x3a 0x1                        # FREQUENCY: 2

                H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0                        # FREQUENCY: 2

                H:0x5 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x1                        # FREQUENCY: 3

                H:0x5 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0                        # FREQUENCY: 5

                H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0                               # FREQUENCY: 6

                H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0                           # FREQUENCY: 7

                H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0                              # FREQUENCY: 17

                H:0x0 D:0x2 P:0x5 Possible sense data: 0x0 0x0 0x0                           # FREQUENCY: 18

 

                # The following LUNs were reported in combination with above SCSI error codes

                naa.60060e80105395f0056fc1ef000000c8                            # FREQUENCY: 18

                naa.60060e80105395f0056fc1ef000000c9                            # FREQUENCY: 42

                               # That translates into vmfs datastore names (only vmfs, not NFS or RDM)

                               naa.60060e80105395f0056fc1ef000000c8:1 DataStore_Prod01

                               naa.60060e80105395f0056fc1ef000000c9:1 DataStore_Test01

 

                # The SCSI error codes in the vmkernel logs have been observed during 6 different days

                2012-05-25         # FREQUENCY: 4

                2012-05-29         # FREQUENCY: 4

                2012-06-06         # FREQUENCY: 4

                2012-06-14         # FREQUENCY: 7

                2012-06-13         # FREQUENCY: 13

                2012-06-20         # FREQUENCY: 28

 

                # The following SCSI commands were reported to have failed

                6 times 0x4d = LOG SENSE

                17 times 0xc0

 

 

 

===> Per Lun/Per Error:

 

Host status                           [ Device Status                      [ Sense Key                                 [ ASC + ASCQ

                                      [ SCSI Status codes appear           [ SCSI Sense Keys appear in the Sense Data  [ Additional Sense Code +

These codes potentially come from     [ in the Status byte returned when   [ available when a command completes        [ Additional Sense Code Qualifier

the firmware on a host adapter        [ processing of a command completes  [ with a CHECK CONDITION status             [ ASC + ASCQ

===============================       ===================================  ================================            ================================

naa.60060e80105395f0056fc1ef000000c8

Host: 0x00 DID_OK                      Device: 02h CHECK CONDITION          Sense: 0h NO SENSE                          ASC: 00h NO ADDITIONAL SENSE INFORMATION

Host: 0x00 DID_OK                      Device: 02h CHECK CONDITION          Sense: 2h NOT READY                         ASC: 3Ah MEDIUM NOT PRESENT - TRAY CLOSED

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 0h NO SENSE                          ASC: 00h NO ADDITIONAL SENSE INFORMATION

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 2h NOT READY                         ASC: 3Ah MEDIUM NOT PRESENT - TRAY CLOSED

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 5h ILLEGAL REQUEST                   ASC: 25h LOGICAL UNIT NOT SUPPORTED

naa.60060e80105395f0056fc1ef000000c9

Host: 0x00 DID_OK                      Device: 02h CHECK CONDITION          Sense: 0h NO SENSE                          ASC: 00h NO ADDITIONAL SENSE INFORMATION

Host: 0x00 DID_OK                      Device: 02h CHECK CONDITION          Sense: 2h NOT READY                         ASC: 3Ah MEDIUM NOT PRESENT - TRAY CLOSED

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 0h NO SENSE                          ASC: 00h NO ADDITIONAL SENSE INFORMATION

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 5h ILLEGAL REQUEST                   ASC: 24h INVALID FIELD IN CDB

Host: 0x05 DID_ABORT                   Device: 00h GOOD                     Sense: 5h ILLEGAL REQUEST                   ASC: 25h LOGICAL UNIT NOT SUPPORTED

 

Conclusion by VMware-> It appears to be an issue with storage as both hosts are receiving a check condition and Aborts from Device (Storage Array).

 

Also, after changing the queue depth to 32 I am still getting latency warnings. And what really drives me crazy, is that I can't reproduce these latency warnings. Even with high I/O I am never getting warnings. And it looks that they always appers at xx:15 to xx:16. But there's no scheduled job, neither on the storage array nor in the vSphere environment, which could cause that impact.

 

HDS Support told me, that there is no problem with the storage array. We saw a fault configured cache size on the SAN but still after the correction, we are still getting latency warnings, not all the time, but sporadic (ones a day or so).

 

Any suggestion on how to solve this problem is very appreciated.

 

If you have any questions regaring my configuration don't hesitate to contact me.

 

Best regards'

 

Marc


Viewing all articles
Browse latest Browse all 18761

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>