SQL SERVER – 2011 – Wait Type – Day 25 of 28

Since the beginning of the series, I have been getting the following question again and again:

“What are the changes in SQL Server 2011 – Denali with respect to Wait Types?”

SQL Server 2011 – Denali is yet to be released, and making statements on the subject will be inappropriate. Denali CTP1 has been released so I suggest that all of you download the same and experiment on it. I quickly compared the wait stats of SQL Server 2008 R2 and Denali (CTP1) and found the following changes:

Wait Types Exists in SQL Server 2008 R2 and Not Exists in SQL Server 2011 “Denali”


Wait Types Exists in SQL Server 2011 and Not Exists in SQL Server 2008


We already know that Wait Types and Wait Stats are going to be the next big thing in the next version of SQL Server. So now I am eagerly waiting to dig deeper in the wait stats.

Read all the post in the Wait Types and Queue series.

Note: The information presented here is from my experience and there is no way that I claim it to be accurate. I suggest reading Book OnLine for further clarification. All the discussion of Wait Stats in this blog is generic and varies from system to system. It is recommended that you test this on a development server before implementing it to a production server.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – 2000 – DBCC SQLPERF(waitstats) – Wait Type – Day 24 of 28

I have received many comments, email, suggestions and motivations for my current series of wait types and wait statistics. One of the questions which I keep on receiving almost every other day is whether all of the discussions I have presented so far are also applicable to SQL Server 2000. Additionally, I receive another question asking me if wait statistics matters in SQL Server 2000. If it is, then the asker wants to know how to measure wait types for SQL Server 2000.

In SQL Server, you can run the following command to get a list of all the wait types:

DBCC SQLPERF(waitstats)

The query above will work in SQL Server 2005/2008/R2  because of backup compatibility. As you might have noticed, I have been discussing everything keeping SQL Server 2005+ in mind, but I have given little consideration on SQL Server 2000. However, I am pretty sure that most of the suggestions I have provided are applicable to SQL Server 2000. The wait types I have been discussing mostly exist in SQL Server 2000 as well. But the difference of the 2000 version is that it gets late recent releases, but it is worth it.

Wait types are very essential to measure performance bottleneck. Because of this, I do not have to state that I am big fan of them just so I could identify performance bottleneck.

Please read all the post in the Wait Types and Queue series.

Note: The information presented here is from my experience and there is no way that I claim it to be accurate. I suggest reading Book OnLine for further clarification. All the discussion of Wait Stats in this blog is generic and varies from system to system. It is recommended that you test this on a development server before implementing it to a production server.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – OLEDB – Link Server – Wait Type – Day 23 of 28

When I decided to start writing about this wait type, the very first question that came to my mind was, “What does ‘OLEDB’ stand for?” A quick search on Wikipedia tells me that OLEDB means Object Linking and Embedding Database. (How many of you knew this?) Anyway, I found it very interesting that this wait type was in one of the top 10 wait types in many of the systems I have come across in my performance tuning experience.

Books On-Line:

OLEDB occurs when SQL Server calls the SQL Server Native Client OLE DB Provider. This wait type is not used for synchronization. Instead, it indicates the duration of calls to the OLE DB provider.

OLEDB Explanation:

This wait type primarily happens when Link Server or Remove Query has been executed. The most common case wherein this wait type is visible is during the execution of Linked Server. When SQL Server is retrieving data from the remote server, it uses OLEDB API to retrieve the data. It is possible that the remote system is not quick enough or the connection between them is not fast enough, leading SQL Server to wait for the result’s return from the remote (or external) server. This is the time OLEDB wait type occurs.

Reducing OLEDB wait:

  • Check the Link Server configuration.
  • Checking Disk-Related Perfmon Counters
    • Average Disk sec/Read (Consistent higher value than 4-8 millisecond is not good)
    • Average Disk sec/Write (Consistent higher value than 4-8 millisecond is not good)
    • Average Disk Read/Write Queue Length (Consistent higher value than benchmark is not good)

At this point in time, I am not able to think of any more ways on reducing this wait type. Do you have any opinion about this subject? Please share it here and I will share your comment with the rest of the Community, and of course, with due credit unto you.

Please read all the post in the Wait Types and Queue series.

Note: The information presented here is from my experience and there is no way that I claim it to be accurate. I suggest reading Book OnLine for further clarification. All the discussion of Wait Stats in this blog is generic and varies from system to system. It is recommended that you test this on a development server before implementing it to a production server.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – Guest Post – Jacob Sebastian – Filestream – Wait Types – Wait Queues – Day 22 of 28

Jacob Sebastian is a SQL Server MVP, Author, Speaker and Trainer. Jacob is one of the top rated expert community. Jacob wrote the book The Art of XSD – SQL Server XML Schema Collections and wrote the XML Chapter in SQL Server 2008 Bible. See his Blog | Profile. He is currently researching on the subject of Filestream and have submitted this interesting article on the very subject.


FILESTREAM is a new feature introduced in SQL Server 2008 which provides an efficient storage and management option for BLOB data.

Many applications that deal with BLOB data today stores them in the file system and stores the path to the file in the relational tables. Storing BLOB data in the file system is more efficient that storing them in the database. However, this brings up a few disadvantages as well. When the BLOB data is stored in the file system, it is hard to ensure transactional consistency between the file system data and relational data.

Some applications store the BLOB data within the database to overcome the limitations mentioned earlier. This approach ensures transactional consistency between the relational data and BLOB data, but is very bad in terms of performance.

FILESTREAM combines the benefits of both approaches mentioned above without the disadvantages we examined. FILESTREAM stores the BLOB data in the file system (thus takes advantage of the IO Streaming capabilities of NTFS) and ensures transactional consistency between the BLOB data in the file system and the relational data in the database.

For more information on the FILESTREAM feature, visit: http://beyondrelational.com/filestream/default.aspx


Since this series is on the different SQL Server wait types, let us take a look at the various wait types that are related to the FILESTREAM feature.


This wait type is generated by FILESTREAM Garbage Collector. This occurs when Garbage collection is disabled prior to a backup/restore operation or when a garbage collection cycle is being executed.


This wait type occurs when during the cleanup process of a garbage collection cycle. It indicates that that garbage collector is waiting for the cleanup tasks to be completed.


This wait type indicates that the process is waiting for obtaining access to the FILESTREAM header file for read or write operation. The FILESTREAM header is a disk file located in the FILESTREAM data container and is named “filestream.hdr”.


This wait type indicates that the process is trying to perform a FILESTREAM log truncation related operation. It can be either a log truncate operation or to disable log truncation prior to a backup or restore operation.


This wait type occurs when a FILESTREAM file I/O operation needs to bind to the associated transaction, but the transaction is currently owned by another session.


This wait type occurs when a FILESTREAM file I/O operation is waiting for a FILESTREAM agent resource that is being used by another file I/O operation.


This wait type occurs when there is a wait for another FILESTREAM feature reconfiguration to be completed.


This wait type occurs when there is a wait to serialize access to the FILESTREAM configuration parameters.

Waits and Performance

System waits has got a direct relationship with the overall performance. In most cases, when waits increase the performance degrades. SQL Server documentation does not say much about how we can reduce these waits. However, following the FILESTREAM best practices will help you to improve the overall performance and reduce the wait types to a good extend.

Read all the post in the Wait Types and Queue series.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – Guest Posts – Feodor Georgiev – The Context of Our Database Environment – Going Beyond the Internal SQL Server Waits – Wait Type – Day 21 of 28

This guest post is submitted by Feodor. Feodor Georgiev is a SQL Server database specialist with extensive experience of thinking both within and outside the box. He has wide experience of different systems and solutions in the fields of architecture, scalability, performance, etc. Feodor has experience with SQL Server 2000 and later versions, and is certified in SQL Server 2008.

In this article Feodor explains the server-client-server process, and concentrated on the mutual waits between client and SQL Server. This is essential in grasping the concept of waits in a ‘global’ application plan.

Recently I was asked to write a blog post about the wait statistics in SQL Server and since I had been thinking about writing it for quite some time now, here it is.

It is a wide-spread idea that the wait statistics in SQL Server will tell you everything about your performance. Well, almost. Or should I say – barely. The reason for this is that SQL Server is always a part of a bigger system – there are always other players in the game: whether it is a client application, web service, any other kind of data import/export process and so on.

In short, the SQL Server surroundings look like this:

This means that SQL Server, aside from its internal waits, also depends on external waits and settings.

As we can see in the picture above, SQL Server needs to have an interface in order to communicate with the surrounding clients over the network. For this communication, SQL Server uses protocol interfaces. I will not go into detail about which protocols are best, but you can read this article. Also, review the information about the TDS (Tabular data stream).

As we all know, our system is only as fast as its slowest component. This means that when we look at our environment as a whole, the SQL Server might be a victim of external pressure, no matter how well we have tuned our database server performance.

Let’s dive into an example: let’s say that we have a web server, hosting a web application which is using data from our SQL Server, hosted on another server. The network card of the web server for some reason is malfunctioning (think of a hardware failure, driver failure, or just improper setup) and does not send/receive data faster than 10Mbs. On the other end, our SQL Server will not be able to send/receive data at a faster rate either. This means that the application users will notify the support team and will say: “My data is coming very slow.”

Now, let’s move on to a bit more exciting example: imagine that there is a similar setup as the example above – one web server and one database server, and the application is not using any stored procedure calls, but instead for every user request the application is sending 80kb query over the network to the SQL Server. (I really thought this does not happen in real life until I saw it one day.) So, what happens in this case? To make things worse, let’s say that the 80kb query text is submitted from the application to the SQL Server at least 100 times per minute, and as often as 300 times per minute in peak times.

Here is what happens: in order for this query to reach the SQL Server, it will have to be broken into a of number network packets (according to the packet size settings) – and will travel over the network. On the other side, our SQL Server network card will receive the packets, will pass them to our network layer, the packets will get assembled, and eventually SQL Server will start processing the query – parsing, allegorizing, generating the query execution plan and so on. So far, we have already had a serious network overhead by waiting for the packets to reach our Database Engine. There will certainly be some processing overhead – until the database engine deals with the 80kb query and its 20 subqueries. The waits you see in the DMVs are actually collected from the point the query reaches the SQL Server and the packets are assembled.

Let’s say that our query is processed and it finally returns 15000 rows. These rows have a certain size as well, depending on the data types returned. This means that the data will have converted to packages (depending on the network size package settings) and will have to reach the application server. There will also be waits, however, this time you will be able to see a wait type in the DMVs called ASYNC_NETWORK_IO. What this wait type indicates is that the client is not consuming the data fast enough and the network buffers are filling up.

Recently Pinal Dave posted a blog on Client Statistics. What Client Statistics does is captures the physical flow characteristics of the query between the client(Management Studio, in this case) and the server and back to the client.

As you see in the image, there are three categories: Query Profile Statistics, Network Statistics and Time Statistics.

Number of server roundtrips–a roundtrip consists of a request sent to the server and a reply from the server to the client. For example, if your query has three select statements, and they are separated by ‘GO’ command, then there will be three different roundtrips.

TDS Packets sent from the client – TDS (tabular data stream) is the language which SQL Server speaks, and in order for applications to communicate with SQL Server, they need to pack the requests in TDS packets. TDS Packets sent from the client is the number of packets sent from the client; in case the request is large, then it may need more buffers, and eventually might even need more server roundtrips.

TDS packets received from server –is the TDS packets sent by the server to the client during the query execution.

Bytes sent from client – is the volume of the data set to our SQL Server, measured in bytes; i.e. how big of a query we have sent to the SQL Server. This is why it is best to use stored procedures, since the reusable code (which already exists as an object in the SQL Server) will only be called as a name of procedure + parameters, and this will minimize the network pressure.

Bytes received from server – is the amount of data the SQL Server has sent to the client, measured in bytes. Depending on the number of rows and the datatypes involved, this number will vary. But still, think about the network load when you request data from SQL Server.

Client processing time – is the amount of time spent in milliseconds between the first received response packet and the last received response packet by the client.

Wait time on server replies – is the time in milliseconds between the last request packet which left the client and the first response packet which came back from the server to the client.

Total execution time – is the sum of client processing time and wait time on server replies (the SQL Server internal processing time)

Here is an illustration of the Client-server communication model which should help you understand the mutual waits in a client-server environment.

Keep in mind that a query with a large ‘wait time on server replies’ means the server took a long time to produce the very first row. This is usual on queries that have operators that need the entire sub-query to evaluate before they proceed (for example, sort and top operators).

However, a query with a very short ‘wait time on server replies’ means that the query was able to return the first row fast. However a long ‘client processing time’ does not necessarily imply the client spent a lot of time processing and the server was blocked waiting on the client. It can simply mean that the server continued to return rows from the result and this is how long it took until the very last row was returned.

The bottom line is that developers and DBAs should work together and think carefully of the resource utilization in the client-server environment. From experience I can say that so far I have seen only cases when the application developers and the Database developers are on their own and do not ask questions about the other party’s world.

I would recommend using the Client Statistics tool during new development to track the performance of the queries, and also to find a synchronous way of utilizing resources between the client – server – client.

Here is another example: think about similar setup as above, but add another server to the game. Let’s say that we keep our media on a separate server, and together with the data from our SQL Server we need to display some images on the webpage requested by our user. No matter how simple or complicated the logic to get the images is, if the images are 500kb each our users will get the page slowly and they will still think that there is something wrong with our data.

Anyway, I don’t mean to get carried away too far from SQL Server. Instead, what I would like to say is that DBAs should also be aware of ‘the big picture’. I wrote a blog post a while back on this topic, and if you are interested, you can read it here about the big picture.

And finally, here are some guidelines for monitoring the network performance and improving it:

  • Run a trace and outline all queries that return more than 1000 rows (in Profiler you can actually filter and sort the captured trace by number of returned rows). This is not a set number; it is more of a guideline. The general thought is that no application user can consume that many rows at once. Ask yourself and your fellow-developers: ‘why?’.
  • Monitor your network counters in Perfmon: Network Interface:Output queue length, Redirector:Network errors/sec, TCPv4: Segments retransmitted/sec and so on.
  • Make sure to establish a good friendship with your network administrator (buy them coffee, for example J ) and get into a conversation about the network settings. Have them explain to you how the network cards are setup – are they standalone, are they ‘teamed’, what are the settings – full duplex and so on.
  • Find some time to read a bit about networking.

In this short blog post I hope I have turned your attention to ‘the big picture’ and the fact that there are other factors affecting our SQL Server, aside from its internal workings. As a further reading I would still highly recommend the Wait Stats series on this blog, also I would recommend you have the coffee break conversation with your network admin as soon as possible.

This guest post is written by Feodor Georgiev.

Read all the post in the Wait Types and Queue series.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – MSQL_XP – Wait Type – Day 20 of 28

In this blog post, I am going to discuss something from my field experience. While consultation, I have seen various wait typed, but one of my customers who has been using SQL Server for all his operations had an interesting issue with a particular wait type. Our customer had more than 100+ SQL Server instances running and the whole server had MSSQL_XP wait type as the most number of wait types. While running sp_who2 and other diagnosis queries, I could not immediately figure out what the issue was because the query with that kind of wait type was nowhere to be found. After a day of research, I was relieved that the solution was very easy to figure out. Let us continue discussing this wait type.

From Book On-Line:

MSQL_XP occurs when a task is waiting for an extended stored procedure to end. SQL Server uses this wait state to detect potential MARS application deadlocks. The wait stops when the extended stored procedure call ends.

MSQL_XP Explanation:

This wait type is created because of the extended stored procedure. Extended Stored Procedures are executed within SQL Server; however, SQL Server has no control over them. Unless you know what the code for the extended stored procedure is and what it is doing, it is impossible to understand why this wait type is coming up.

Reducing MSQL_XP wait:

As discussed, it is hard to understand the Extended Stored Procedure if the code for it is not available. In the scenario described at the beginning of this post, our client was using third-party backup tool. The third-party backup tool was using Extended Stored Procedure. After we learned that this wait type was coming from the extended stored procedure of the backup tool they were using, we contacted the tech team of its vendor. The vendor admitted that the code was not optimal at some places, and within that day they had provided the patch. Once the updated version was installed, the issue on this wait type disappeared. As viewed in the wait statistics of all the 100+ SQL Server, there was no more MSSQL_XP wait type found.

In simpler terms, you must first identify which Extended Stored Procedure is creating the wait type of MSSQL_XP and see if you can get in touch with the creator of the SP so you can help them optimize the code.

If you have encountered this MSSQL_XP wait type, I encourage all of you to write how you managed it. Please do not mention the name of the vendor in your comment as I will not approve it. The focus of this blog post is to understand the wait types; not talk about others.

Read all the post in the Wait Types and Queue series.

Note: The information presented here is from my experience and there is no way that I claim it to be accurate. I suggest reading Book OnLine for further clarification. All the discussion of Wait Stats in this blog is generic and varies from system to system. It is recommended that you test this on a development server before implementing it to a production server.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – PREEMPTIVE and Non-PREEMPTIVE – Wait Type – Day 19 of 28

In this blog post, we are going to talk about a very interesting subject. I often get questions related to SQL Server 2008 Book-Online about various Preemptive wait types. I got a few questions asking what these wait types are and how they could be interpreted. To get current wait types of the system, you can read this article and run the script: SQL SERVER – DMV – sys.dm_os_waiting_tasks and sys.dm_exec_requests – Wait Type – Day 4 of 28.

Before we continue understanding them, let us study first what PREEMPTIVE and Non-PREEMPTIVE waits in SQL Server mean.

PREEMPTIVE: Simply put, this wait means non-cooperative. While SQL Server is executing a task, the Operating System (OS) interrupts it. This leads to SQL Server to involuntarily give up the execution for other higher priority tasks. This is not good for SQL Server as it is a particular external process which makes SQL Server to yield. This kind of wait can reduce the performance drastically and needs to be investigated properly.

Non-PREEMPTIVE: In simple terms, this wait means cooperative. SQL Server manages the scheduling of the threads. When SQL Server manages the scheduling instead of the OS, it makes sure its own priority. In this case, SQL Server decides the priority and one thread yields to another thread voluntarily.

In the earlier version of SQL Server, there was no preemptive wait types mentioned and the associated task status with them was marked as suspended. In SQL Server 2005, preemptive wait types were not listed as well, but their associated task status was marked as running. In SQL Server 2008, preemptive wait types are properly listed and their associated task status is also marked as running.

Now, SQL Server is in Non-Preemptive mode by default and it works fine. When CLR, extended Stored Procedures and other external components run, they run in Preemptive mode, leading to the creation of these wait types.

There are a wide variety of preemptive wait types. If you see consistent high value in the Preemptive wait types, I strongly suggest that you look into the wait type and try to know the root cause.

If you are still not sure, you can send me an email or leave a comment about it and I will do my best to help you reduce this wait type.

Read all the post in the Wait Types and Queue series.

Reference: Pinal Dave (http://blog.sqlauthority.com)

SQL SERVER – LOGBUFFER – Wait Type – Day 18 of 28

At first, I was not planning to write about this wait type. The reason was simple- I have faced this only once in my lifetime so far maybe because it is one of the top 5 wait types. I am not sure if it is a common wait type or not, but in the samples I had it really looks rare to me.

From Book On-Line:

Occurs when a task is waiting for space in the log buffer to store a log record. Consistently high values may indicate that the log devices cannot keep up with the amount of log being generated by the server.

LOGBUFFER Explanation:

The book online definition of the LOGBUFFER seems to be very accurate. On the system where I faced this wait type, the log file (LDF) was put on the local disk, and the data files (MDF, NDF) were put on SanDrives. My client then was not familiar about how the file distribution was supposed to be. Once we moved the LDF to a faster drive, this wait type disappeared.

Reducing LOGBUFFER wait:

There are several suggestions to reduce this wait stats:

  • Move Transaction Log to Separate Disk from mdf and other files. (Make sure your drive where your LDF is has no IO bottleneck issues).
  • Avoid cursor-like coding methodology and frequent commit statements.
  • Find the most-active file based on IO stall time, as shown in the script written over here.
  • You can also use fn_virtualfilestats to find IO-related issues using the script mentioned over here.
  • Check the IO-related counters (PhysicalDisk:Avg.Disk Queue Length, PhysicalDisk:Disk Read Bytes/sec and PhysicalDisk :Disk Write Bytes/sec) for additional details. Read about them over here.

If you have noticed, my suggestions for reducing the LOGBUFFER is very similar to WRITELOG. Although the procedures on reducing them are alike, I am not suggesting that LOGBUFFER and WRITELOG are same wait types. From the definition of the two, you will find their difference. However, they are both related to LOG and both of them can severely degrade the performance.

Note: The information presented here is from my experience and there is no way that I claim it to be accurate. I suggest reading Book OnLine for further clarification. All the discussion of Wait Stats in this blog is generic and varies from system to system. It is recommended that you test this on a development server before implementing it to a production server.

Reference: Pinal Dave (http://blog.SQLAuthority.com)


SQL SERVER – WRITELOG – Wait Type – Day 17 of 28

WRITELOG is one of the most interesting wait types. So far we have seen a lot of different wait types, but this log type is associated with log file which makes it interesting to deal with.

From Book On-Line:

Occurs while waiting for a log flush to complete. Common operations that cause log flushes are checkpoints and transaction commits.

WRITELOG Explanation:

This wait type is usually seen in the heavy transactional database. When data is modified, it is written both on the log cache and buffer cache. This wait type occurs when data in the log cache is flushing to the disk. During this time, the session has to wait due to WRITELOG.

I have recently seen this wait type’s persistence at my client’s place, where one of the long-running transactions was stopped by the user causing it to roll back. In the future, I will see if I could re-create this situation once again on my machine to validate the relation.

Reducing WRITELOG wait:

There are several suggestions to reduce this wait stats:

  • Move Transaction Log to Separate Disk from mdf and other files.
  • Avoid cursor-like coding methodology and frequent committing of statements.
  • Find the most active file based on IO stall time based on the script written over here.
  • You can also use fn_virtualfilestats to find IO-related issues using the script mentioned over here.
  • Check the IO-related counters (PhysicalDisk:Avg.Disk Queue Length, PhysicalDisk:Disk Read Bytes/sec and PhysicalDisk :Disk Write Bytes/sec) for additional details. Read about them over here.

There are two excellent resources by Paul Randal, I suggest you understand the subject from those videos. The links to videos are here and here.

Read all the post in the Wait Types and Queue series.

Note: The information presented here is from my experience and there is no way that I claim it to be accurate. I suggest reading Book OnLine for further clarification. All the discussion of Wait Stats in this blog is generic and varies from system to system. It is recommended that you test this on a development server before implementing it to a production server.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – Guest Post – Jonathan Kehayias – Wait Type – Day 16 of 28

Jonathan Kehayias (Blog | Twitter) is a MCITP Database Administrator and Developer, who got started in SQL Server in 2004 as a database developer and report writer in the natural gas industry. After spending two and a half years working in TSQL, in late 2006, he transitioned to the role of SQL Database Administrator. His primary passion is performance tuning, where he frequently rewrites queries for better performance and performs in depth analysis of index implementation and usage. Jonathan blogs regularly on SQLBlog, and was a coauthor of Professional SQL Server 2008 Internals and Troubleshooting.

On a personal note, I think Jonathan is extremely positive person. In every conversation with him I have found that he is always eager to help and encourage. Every time he finds something needs to be approved, he has contacted me without hesitation and guided me to improve, change and learn. During all the time, he has not lost his focus to help larger community. I am honored that he has accepted to provide his views on complex subject of Wait Types and Queues. Currently I am reading his series on Extended Events.

Here is the guest blog post by Jonathan:

SQL Server troubleshooting is all about correlating related pieces of information together to indentify where exactly the root cause of a problem lies. In my daily work as a DBA, I generally get phone calls like, “So and so application is slow, what’s wrong with the SQL Server.” One of the funny things about the letters DBA is that they go so well with Default Blame Acceptor, and I really wish that I knew exactly who the first person was that pointed that out to me, because it really fits at times. A lot of times when I get this call, the problem isn’t related to SQL Server at all, but every now and then in my initial quick checks, something pops up that makes me start looking at things further.

The SQL Server is slow, we see a number of tasks waiting on ASYNC_IO_COMPLETION, IO_COMPLETION, or PAGEIOLATCH_* waits in sys.dm_exec_requests and sys.dm_exec_waiting_tasks. These are also some of the highest wait types in sys.dm_os_wait_stats for the server, so it would appear that we have a disk I/O bottleneck on the machine. A quick check of sys.dm_io_virtual_file_stats() and tempdb shows a high write stall rate, while our user databases show high read stall rates on the data files. A quick check of some performance counters and Page Life Expectancy on the server is bouncing up and down in the 50-150 range, the Free Page counter consistently hits zero, and the Free List Stalls/sec counter keeps jumping over 10, but Buffer Cache Hit Ratio is 98-99%. Where exactly is the problem?

In this case, which happens to be based on a real scenario I faced a few years back, the problem may not be a disk bottleneck at all; it may very well be a memory pressure issue on the server. A quick check of the system spec’s and it is a dual duo core server with 8GB RAM running SQL Server 2005 SP1 x64 on Windows Server 2003 R2 x64. Max Server memory is configured at 6GB and we think that this should be enough to handle the workload; or is it? This is a unique scenario because there are a couple of things happening inside of this system, and they all relate to what the root cause of the performance problem is on the system.

If we were to query sys.dm_exec_query_stats for the TOP 10 queries, by max_physical_reads, max_logical_reads, and max_worker_time, we may be able to find some queries that were using excessive I/O and possibly CPU against the system in their worst single execution. We can also CROSS APPLY to sys.dm_exec_sql_text() and see the statement text, and also CROSS APPLY sys.dm_exec_query_plan() to get the execution plan stored in cache. Ok, quick check, the plans are pretty big, I see some large index seeks, that estimate 2.8GB of data movement between operators, but everything looks like it is optimized the best it can be. Nothing really stands out in the code, and the indexing looks correct, and I should have enough memory to handle this in cache, so it must be a disk I/O problem right?

Not exactly!

If we were to look at how much memory the plan cache is taking by querying sys.dm_os_memory_clerks for the CACHESTORE_SQLCP and CACHESTORE_OBJCP clerks we might be surprised at what we find. In SQL Server 2005 RTM and SP1, the plan cache was allowed to take up to 75% of the memory under 8GB. I’ll give you a second to go back and read that again. Yes, you read it correctly, it says 75% of the memory under 8GB, but you don’t have to take my word for it, you can validate this by reading Changes in Caching Behavior between SQL Server 2000, SQL Server 2005 RTM and SQL Server 2005 SP2. In this scenario the application uses an entirely adhoc workload against SQL Server and this leads to plan cache bloat, and up to 4.5GB of our 6GB of memory for SQL can be consumed by the plan cache in SQL Server 2005 SP1. This in turn reduces the size of the buffer cache to just 1.5GB, causing our 2.8GB of data movement in this expensive plan to cause complete flushing of the buffer cache, not just once initially, but then another time during the queries execution, resulting in excessive physical I/O from disk. Keep in mind that this is not the only query executing at the time this occurs. Remember the output of sys.dm_io_virtual_file_stats() showed high read stalls on the data files for our user databases versus higher write stalls for tempdb? The memory pressure is also forcing heavier use of tempdb to handle sorting and hashing in the environment as well.

The real clue here is the Memory counters for the instance; Page Life Expectancy, Free List Pages, and Free List Stalls/sec. The fact that Page Life Expectancy is fluctuating between 50 and 150 constantly is a sign that the buffer cache is experiencing constant churn of data, once every minute to two and a half minutes. If you add to the Page Life Expectancy counter, the consistent bottoming out of Free List Pages along with Free List Stalls/sec consistently spiking over 10, and you have the perfect memory pressure scenario. All of sudden it may not be that our disk subsystem is the problem, but is instead an innocent bystander and victim.

Side Note: The Page Life Expectancy counter dropping briefly and then returning to normal operating values intermittently is not necessarily a sign that the server is under memory pressure. The Books Online and a number of other references will tell you that this counter should remain on average above 300 which is the time in seconds a page will remain in cache before being flushed or aged out. This number, which equates to just five minutes, is incredibly low for modern systems and most published documents pre-date the predominance of 64 bit computing and easy availability to larger amounts of memory in SQL Servers. As food for thought, consider that my personal laptop has more memory in it than most SQL Servers did at the time those numbers were posted. I would argue that today, a system churning the buffer cache every five minutes is in need of some serious tuning or a hardware upgrade.

Back to our problem and its investigation:

There are two things really wrong with this server; first the plan cache is excessively consuming memory and bloated in size and we need to look at that and second we need to evaluate upgrading the memory to accommodate the workload being performed. In the case of the server I was working on there were a lot of single use plans found in sys.dm_exec_cached_plans (where usecounts=1). Single use plans waste space in the plan cache, especially when they are adhoc plans for statements that had concatenated filter criteria that is not likely to reoccur with any frequency.  SQL Server 2005 doesn’t natively have a way to evict a single plan from cache like SQL Server 2008 does, but MVP Kalen Delaney, showed a hack to evict a single plan by creating a plan guide for the statement and then dropping that plan guide in her blog post Geek City: Clearing a Single Plan from Cache. We could put that hack in place in a job to automate cleaning out all the single use plans periodically, minimizing the size of the plan cache, but a better solution would be to fix the application so that it uses proper parameterized calls to the database. You didn’t write the app, and you can’t change its design? Ok, well you could try to force parameterization to occur by creating and keeping plan guides in place, or we can try forcing parameterization at the database level by using ALTER DATABASE <dbname> SET PARAMETERIZATION FORCED and that might help. If neither of these help, we could periodically dump the plan cache for that database, as discussed as being a problem in Kalen’s blog post referenced above; not an ideal scenario.

The other option is to increase the memory on the server to 16GB or 32GB, if the hardware allows it, which will increase the size of the plan cache as well as the buffer cache. In SQL Server 2005 SP1, on a system with 16GB of memory, if we set max server memory to 14GB the plan cache could use at most 9GB  [(8GB*.75)+(6GB*.5)=(6+3)=9GB], leaving 5GB for the buffer cache.  If we went to 32GB of memory and set max server memory to 28GB, the plan cache could use at most 16GB [(8*.75)+(20*.5)=(6+10)=16GB], leaving 12GB for the buffer cache. Thankfully we have SQL Server 2005 Service Pack 2, 3, and 4 these days which include the changes in plan cache sizing discussed in the Changes to Caching Behavior between SQL Server 2000, SQL Server 2005 RTM and SQL Server 2005 SP2 blog post. In real life, when I was troubleshooting this problem, I spent a week trying to chase down the cause of the disk I/O bottleneck with our Server Admin and SAN Admin, and there wasn’t much that could be done immediately there, so I finally asked if we could increase the memory on the server to 16GB, which did fix the problem. It wasn’t until I had this same problem occur on another system that I actually figured out how to really troubleshoot this down to the root cause.  I couldn’t believe the size of the plan cache on the server with 16GB of memory when I actually learned about this and went back to look at it.

SQL Server is constantly telling a story to anyone that will listen. As the DBA, you have to sit back and listen to all that it’s telling you and then evaluate the big picture and how all the data you can gather from SQL about performance relate to each other. One of the greatest tools out there is actually a free in the form of Diagnostic Scripts for SQL Server 2005 and 2008, created by MVP Glenn Alan Berry. Glenn’s scripts collect a majority of the information that SQL has to offer for rapid troubleshooting of problems, and he includes a lot of notes about what the outputs of each individual query might be telling you.

When I read Pinal’s blog post SQL SERVER – ASYNC_IO_COMPLETION – Wait Type – Day 11 of 28, I noticed that he referenced Checking Memory Related Performance Counters in his post, but there was no real explanation about why checking memory counters is so important when looking at an I/O related wait type. I thought I’d chat with him briefly on Google Talk/Twitter DM and point this out, and offer a couple of other points I noted, so that he could add the information to his blog post if he found it useful.  Instead he asked that I write a guest blog for this. I am honored to be a guest blogger, and to be able to share this kind of information with the community. The information contained in this blog post is a glimpse at how I do troubleshooting almost every day of the week in my own environment. SQL Server provides us with a lot of information about how it is running, and where it may be having problems, it is up to us to play detective and find out how all that information comes together to tell us what’s really the problem.

This blog post is written by Jonathan Kehayias (Blog | Twitter).

Read all the post in the Wait Types and Queue series.

Reference: Pinal Dave (http://blog.SQLAuthority.com)