It is very easy to say that you replace your hardware as that is not up to the mark. In reality, it is very difficult to implement. It is really hard to convince an infrastructure team to change any hardware because they are not performing at their best.
I had a nightmare related to this issue in a deal with an infrastructure team as I suggested that they replace their faulty hardware. This is because they were initially not accepting the fact that it is the fault of their hardware. But it is really easy to say “Trust me, I am correct”, while it is equally important that you put some logical reasoning along with this statement.
PAGEIOLATCH_XX is such a kind of those wait stats that we would directly like to blame on the underlying subsystem. Of course, most of the time, it is correct – the underlying subsystem is usually the problem.
From Book On-Line:
PAGEIOLATCH_DT
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Destroy mode. Long waits may indicate problems with the disk subsystem.
PAGEIOLATCH_EX
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Exclusive mode. Long waits may indicate problems with the disk subsystem.
PAGEIOLATCH_KP
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Keep mode. Long waits may indicate problems with the disk subsystem.
PAGEIOLATCH_SH
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Shared mode. Long waits may indicate problems with the disk subsystem.
PAGEIOLATCH_UP
Occurs when a task is waiting on a latch for a buffer that is in an I/O request. The latch request is in Update mode. Long waits may indicate problems with the disk subsystem.
PAGEIOLATCH_XX Explanation:
Simply put, this particular wait type occurs when any of the tasks is waiting for data from the disk to move to the buffer cache.
ReducingPAGEIOLATCH_XX wait:
Just like any other wait type, this is again a very challenging and interesting subject to resolve. Here are a few things you can experiment on:
- Improve your IO subsystem speed (read the first paragraph of this article, if you have not read it, I repeat that it is easy to say a step like this than to actually implement or do it).
- This type of wait stats can also happen due to memory pressure or any other memory issues. Putting aside the issue of a faulty IO subsystem, this wait type warrants proper analysis of the memory counters. If due to any reasons, the memory is not optimal and unable to receive the IO data. This situation can create this kind of wait type.
- Proper placing of files is very important. We should check file system for the proper placement of files – LDF and MDF on separate drive, TempDB on separate drive, hot spot tables on separate filegroup (and on separate disk), etc.
- Check the File Statistics and see if there is higher IO Read and IO Write Stall SQL SERVER – Get File Statistics Using fn_virtualfilestats.
- It is very possible that there are no proper indexes on the system and there are lots of table scans and heap scans. Creating proper index can reduce the IO bandwidth considerably. If SQL Server can use appropriate cover index instead of clustered index, it can significantly reduce lots of CPU, Memory and IO (considering cover index has much lesser columns than cluster table and all other it depends conditions). You can refer to the two articles’ links below previously written by me that talk about how to optimize indexes.
- Updating statistics can help the Query Optimizer to render optimal plan, which can only be either directly or indirectly. I have seen that updating statistics with full scan (again, if your database is huge and you cannot do this – never mind!) can provide optimal information to SQL Server optimizer leading to efficient plan.
- Checking Memory Related Perfmon Counters
- SQLServer: Memory Manager\Memory Grants Pending (Consistent higher value than 0-2)
- SQLServer: Memory Manager\Memory Grants Outstanding (Consistent higher value, Benchmark)
- SQLServer: Buffer Manager\Buffer Hit Cache Ratio (Higher is better, greater than 90% for usually smooth running system)
- SQLServer: Buffer Manager\Page Life Expectancy (Consistent lower value than 300 seconds)
- Memory: Available Mbytes (Information only)
- Memory: Page Faults/sec (Benchmark only)
- Memory: Pages/sec (Benchmark only)
- Checking Disk Related Perfmon Counters
- Average Disk sec/Read (Consistent higher value than 4-8 millisecond is not good)
- Average Disk sec/Write (Consistent higher value than 4-8 millisecond is not good)
- Average Disk Read/Write Queue Length (Consistent higher value than benchmark is not good)
Read all the post in the Wait Types and Queue series.
Note: The information presented here is from my experience and there is no way that I claim it to be accurate. I suggest reading Book OnLine for further clarification. All of the discussions of Wait Stats in this blog is generic and varies from system to system. It is recommended that you test this on a development server before implementing it to a production server.
Reference: Pinal Dave (https://blog.sqlauthority.com)
10 Comments. Leave new
I’ve found this article good when trying to determine if there’s I/O problems with the server, be it Sql Server or any other server.
https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2005/administrator/cc966540(v=technet.10)
If you scroll down to where it says I/O Bottlenecks you’ll see list of 6 items in it. I think that’s a good starting point. Fire up Performance Monitor and monitor your disks while Sql Server is under load.
“If your disk queue length frequently exceeds a value of 2 during peak usage of SQL Server, then you might have an I/O bottleneck.”
One of our server had queue length 10-18 during peak usage. That was enough for me to announce that we are having I/O problems ;)
-Marko
I would additionally adhere to the point of proper placing of files by adding and the factor of file and page contention. In my practise I have seen for example tempdb which under heavy file contention is making SQL Server unresponsive and slow. So I would add to that point and considering splitting a database to different files (this is especially valid for tempdb, where there is an official reccomendation from Microsoft why to do so)
How would you split MDF, LDF or tables to different separate disk(s) in a SAN environment?
Tip: By all means the Disk I/O & Memory are key indicators to cause the performance issues where the wait types will highlight the cause, so keeping a close watch on these factors will help to reduce the problem and then jump into QUERY tuning.
I was having problems and this article was very helpful.
What do you mean by “Consistent higher value, Benchmark” under your second point under memory related counters? How do I know what is the benchmark?
SQLServer: Memory ManagerMemory Grants Outstanding (Consistent higher value, Benchmark)
Thank you!
TimB
Hi Pinal,
I use the third party backup tool to take compressed backups and the backup is transferred to some backup server through FTP tasks scheduled in windows tasks. when monitored with Memory : Page Faults/Sec in perfmon i have observed that the counter value is near 80-100 when backup occur and while transfer it is 100 consistently. Normally the value fluctuates between 40-100.
My server is configured with 49136MB page file,32GB ram and 8 logical CPU.
CPU utilization is near 75-95% Any suggestions to reduce the page faults/sec count would be much helpful
I have also found PageIOLatch_SH wait type in my long running queries. I checked perfmon and found that in avg.DiskQueLength counter avg value is 10 – 14. Is that because of I/O bottlenecks. If yes then which hardware need to be update?
What I liked most about this post is the start – it’s the absolute truth, and nothing but the truth!
We use virtualiazation in our development & QA environments. Once, we restored a database of around 50GB, and our servers simply stopped processing our queries. The nightly jobs ran for 3 days and the application ended up in timeouts. We looked at the wait stats and came across not one, but two PAGEIOLATCH stats – PAGEIOLATCH_EX and PAGEIOLATCH_SH.
We informed our IT, and instead of looking at our data, they increased the processor and memory allocations. Their reasoning – they have the best of the line hardware, and no application should require to use more!!!!
When that did not fix it and we escalated matters, they deployed complex monitoring, and guess what? – we got a brand new IO subsystem for our Virtual Machines within 2 weeks!
Dave, it seems almost every SQL search I type into Google, it brings me to you. Thank you for all your sharing. It has helped me countless times.
I thought >15-20 millisecond is not good!!!!!!!!!!!!
Checking Disk Related Perfmon Counters
Average Disk sec/Read (Consistent higher value than 4-8 millisecond is not good)
Average Disk sec/Write (Consistent higher value than 4-8 millisecond is not good)