This is one of the most commonly asked questions via blog comments and emails sent across to me. Here is the sample error message which has been reported in SQL Server ERRORLOG.
SQL Server has encountered 201 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [T:\MSSQL\DATA\tempdb.mdf] in database id tempdb . The OS file handle is 0x0000000000000770. The offset of the latest long I/O is: 0x0000000008c000
If you are not a SQL Expert and don’t know what ERRORLOG is, no need to worry. Here is the blog to check location of SQL Server ERRORLOG
Coming back to the error message, the text which is highlighted in bold would vary based on the problem. The sample error message is for tempdb database which has database ID = 2. Path of the MDF file is T drive in the sample message. Some DBA also refer this as a stalled IO. In the sample test we are seeing 201 occurrences but I have seen value going till 5 digits also.
So, what’s the cause of the error message? Is there any problem with SQL Server? Most of the time answer is NO because SQL is reporting warning in Errorlog when it is not able to complete the IO in even in 15 seconds. If you have read any blog or book covering internals of SQL Server, you might know that SQL Server does asynchronous IO, using the win32 APIs like ReadFile, WriteFile, ReadFileScatter and WriteFileGather. One IO is posted, the thread would return and do meaningful work rather than waiting for IO to complete. The threads would keep checking the pending IO and report if they have taken longer.
Here are the thing which a DBA should do before going to hardware team.
1. Note the pattern of the message. If they are logged at any specific time? Check if there is any scheduled maintenance job is running at the same time? Rebuild index and other IO intensive tasks would perform a lot of IO and disk may report slowness at that time.
2. Check Event log (system and application) and check if there are any disk related error or warning messages.
3. Keep monitoring sysprocesses and check for waits which are happening for SPIDs. If the disk is slow all the times then you should see waits of PAGEIOLATCH, WRITELOG or LOGBUFFER with high wait time. Refer my earlier blogs.
4. Capture disk related data using peformance monitor (start > run > perfmon) for extended amount of time and have a look at counters. You should look at the performance counters related to storage performance to evaluate whether the numbers you are seeing are expected and within reasonable bounds.
Avg Disk Sec/Transfer: This counter is available under Logical Disk and Physical Disk object. The value of this counter should be less than 0.020 (=20 ms). Same counter is available for Read and Write also and can be used to find performance difference between reads and writes.
Disk Bytes/sec: SAN’s generally start from 200-250 MB/s these days (note that the counter value is in bytes so we need to divide that by 1024). Same counter is available for read and write also.
Disk Transfers/sec: this counter represents Number of read and write performance also known as IOPS (I/O per second). This can be used to compare against capacity of storage subsystem. Read and write counter also available.
5. Make sure latest Windows patch has been applied to avoid any known issues with windows itself.
Once basics have been verified and we have confirmed that we are seeing symptoms of slow response from disk, hardware team need to be engaged for further investigation.
Reference: Pinal Dave (http://blog.sqlauthority.com)