In the past few days, I am being contacted by clients for AlwaysOn related issue. I have been writing a blog about them. In this blog, we would learn about how to fix wait for HADR_AR_CRITICAL_SECTION_ENTRY.
THE SITUATION
My client was using SQL Server in a virtual environment. Due to some instability with their network infrastructure, windows cluster lost quorum for few minutes and then it came back. As you might know that AlwaysOn availability group is tightly coupled with windows server failover cluster, so anything happening in the cluster could also impact AlwaysOn availability group. That is what precisely has happened here.
As usual, they sent me an email, I responded back with GoToMeeting details and we were talking to each other in a few minutes. When I joined the call with them:
- All of our AG modification queries (removing availability database, removing availability replica) were stuck waiting on HADR_AR_CRITICAL_SECTION_ENTRY.
- We were unable to make modifications to the AG as it was in an inconsistent state, pending updating the state of the replica.
- As per the Microsoft docs – Occurs when an Always On DDL statement or Windows Server Failover Clustering command is waiting for exclusive read/write access to the runtime state of the local replica of the associated availability group.
SOLUTION/WORKAROUND
Based on my search on the internet, restart of SQL instance is the only way to come out of this.
We set the AG failover to manual and restarted both replicas; after doing so, our secondary replica became synchronized after a few minutes and we were able to successfully remove databases from the AG. We tested failover back and forth, and everything was working as expected.’
Have you seen this wait in your environment? It would be great if you can share the cause of that via comments and how did you come out of it.
Reference:Â Pinal Dave (https://blog.sqlauthority.com)
4 Comments. Leave new
Was there any data loss during the downtime? We’re also trying to setup the same in our environment.
We ran into this issue just today. Something as yet unknown caused quorum to be lost. Eventually failed over. Except one of our DBs was in “not synchronizing” on both the primary and replica, and therefore not available for the application. We then saw the same wait type with any action on the AG itself. Our only solution was instance restarts followed by full reboot of the current primary. Did you end up finding a root cause? SQL 2016 SP2-GDR
We had similar issue in our prod env. We have 3 node cluster where 2 primaries share same secondary. We had issue with one of the sets while other set kept working through out. Issue started with lease timeout expiration. First time cluster was able to recover on its own but second time around synchronization was suspended and secondary went into resolving state. All databases on secondary had a Red cross. Similar to what you experienced none of the AG modification queries worked and they kept waiting on “HADR_AR_CRITICAL_SECTION_ENTRY “.
We are on SQL server 2017
Bit of a late reply but more of a content add if anything.
I’ve experienced a few of these in my current DBA role and they all come as a result of an unstable or saturated network between the nodes due to an ‘infrastructure event’ (read in to that as you will, but in our last case a bunch of VMs went down and the network became saturated as a result).
I went as far as digging into the WFC logs to figure it out and best I came out with was delayed responses between secondary’s with no real fix for this other than what the blog post already suggests.
I would always have in the back of my mind to check out network related events for the source when encountering this SQL error, and get that team involved. You can always rely on WFC error logs to confirm node communication issues as well. The AG tech is so intertwined with WFC and the whole thing is built on stable intra-cluster communication.