SQL SERVER – AlwaysOn Availability Group Stuck in Resolving State For Long time

February 27, 2016

This is one of the common issues which I heard from my clients using SQL Server AlwaysOn and they don’t know why it happens.

SQL SERVER - AlwaysOn Availability Group Stuck in Resolving State For Long time 108-1

What is resolving state in SQL Server AlwaysOn?

When there is an availability group, the replica would be either in primary state or secondary state – when its online in failover cluster manager. Resolving is an intermediate state when the transition is happening from primary to secondary or vice versa. If due to some reason the transition is not successful, it goes to “resolving” state. In this state, the database is not accessible.

What you should do to find the cause?

I always ask for ERRORLOG in such situations to know the series of events. Where is ERRORLOG? Various Ways to Find its Location

2016-02-22 17:36:14.75 Server      ***Unable to get thread context for spid 0
2016-02-22 17:36:14.75 Server      * *******************************************************************************
2016-02-22 17:36:14.75 Server      *
2016-02-22 17:36:14.75 Server      * BEGIN STACK DUMP:
2016-02-22 17:36:14.75 Server      *   04/22/15 17:36:14 spid 7416
2016-02-22 17:36:14.75 Server      *
2016-02-22 17:36:14.75 Server      * Non-yielding Scheduler
2016-02-22 17:36:14.75 Server      *
2016-02-22 17:36:14.75 Server      * *******************************************************************************
2016-02-22 17:36:14.76 Server      Stack Signature for the dump is 0x000000000000025C
……
2016-02-22 17:36:26.19 Server      Error: 19407, Severity: 16, State: 1.
2016-02-22 17:36:26.19 Server      The lease between availability group ‘ProdAG’ and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
2016-02-22 17:36:26.20 Server      AlwaysOn: The local replica of availability group ‘ProdAG’ is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.

When I dug further I found that non-yielding came due to a restore command. The restore contributed to the IO load on the server, causing a non-yielding scheduler to occur within the SQL Server. This kept the SQL Server from renewing the lease with the cluster, causing the lease timeout for the availability group, which then caused the availability to go into a resolving state

Along with Errorlog, it’s always good to have Windows event logs (Application and System) and Cluster log from all participating nodes.

Have you ever faced such situation where the availability group was in resolving error?

If you like this blog post, here is another blog post where we have discussed how you can add files to the database in the AlwaysOn Availability Group.

Reference: Pinal Dave (https://blog.sqlauthority.com)

AlwaysOn, SQL Error Messages, SQL Server

SQL SERVER – 2016 – Check Value as JSON With ISJSON()

SQL SERVER – Unable to start SQL Server Agent – Failed to Initialize SQL Agent log

20 Comments. Leave new

Perry whittle
February 27, 2016 2:42 pm
There is no cluster log in Windows 2008 onwards, you can export cluster events via powershell to a log and this covers all nodes.
If secondary databases are stuck in resolving state check the status of the AO cluster role and its resource
Regards Perry
Reply
- Pinal Dave
  March 14, 2016 4:03 am
  Get-ClusterLog would do that.
  Reply
Sharon Rimer
February 27, 2016 3:02 pm
Good lost.
Yes I did faced such situation.
I start all over again the AG.
Reply
- Pinal Dave
  March 14, 2016 4:04 am
  Thanks Sharon!
  Reply
Jude
March 4, 2016 1:00 am
I would have thought the db neededs to be taken out of the AG for the restore to initiate and tge resolving state would have occurred as a result at the rest of the nodes.
Reply
- Pinal Dave
  March 14, 2016 4:05 am
  In this case, restart fixed it.
  Reply
DigitalAge17
March 22, 2016 10:24 pm
Rebooting each node will fix this issue. it helped me to fix mine.
Reply
Kul
July 19, 2016 12:57 am
Hi Pinal, If this often what would be a possible resolution.
Reply
Michele
March 27, 2017 8:36 pm
Hi Pinal,
While doing a maintenance plan, DBCC CHECKDB find a corrupion on a database it produce a STACK DUMP and cause my AG to failover…
It is the correct behaviour in case of maintenace plan?
Reply
- Pinal Dave
  March 29, 2017 7:51 am
  Having large number of dumps can cause failover. Fix the corruption first.
  Reply
yogigollapudi
May 16, 2017 3:21 am
Hi pinal, on my AG environent , Ag resource went to offline because of least timeout and automatic failover doesnot happened. i did manual failover after force start of cluster and Ag resource. now the databases are online. how to avoid this scenario in future?
Reply
yogigollapudi
May 16, 2017 3:31 am
** Lease timeout not a least timeout
Reply
Vikash Singh
July 22, 2017 1:58 am
Hi Pinal,
I have come across a situation where a SELECT statement running on Primary goes to ROLLBACK when killed.
Could you please suggest a way to troubleshoot this ?
Reply
Talwinder Singh
September 27, 2017 10:13 am
Hi Pinal,
I have a situation where secondary goes into resolving state during automatic failover. Node1 to node2 automatic failover works fine but node2 to node1 does not work and node1 goes into resolving state forever. Manual failover works fine both way. Here are the error messages I see in event viewer.
Event ID: 1069
Cluster resource ‘XEN_DESKTOP’ of type ‘SQL Server Availability Group’ in clustered role ‘XEN_DESKTOP’ failed.
Event ID: 1205
The Cluster service failed to bring clustered role ‘XEN_DESKTOP’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.
Event ID: 1254
Clustered role ‘XEN_DESKTOP’ has exceeded its failover threshold.
Reply
colleen
October 10, 2017 10:37 pm
Hello, I know how to look at the logs but I cannot fix the issue. You don’t leave us any resolution or suggestions so unfortunately I don’t find this helpful
Reply
Jorge Berrios
October 15, 2019 9:08 pm
I reboot the secondary replica and then AG start to synchronizing again
Reply
Abhishek Jain
June 19, 2020 12:12 am
This did the trick for me !!
Stop the cluster service then bring it online and fix the condition with the witness
net.exe stop clussvc
net.exe start clussvc /forcequorum
Force the replica 01 become online using the command:
ALTER AVAILABILITY GROUP AGNAME FORCE_FAILOVER_ALLOW_DATA_LOSS;
Reply
Bhargav
October 21, 2020 12:04 am
We have experienced SQL Dump for couple of processes most of the time and AG DBs goes to RESOLVING state. It turned out to be an issue with SQL Server 2019 performs an internal function on binary concatenation. The fix seems to be in CU8. But every time we have seen EXCEPTION_ACCESS_VIOLATION sql dump while this issue arise.
Reply
kumar sanku
October 23, 2020 11:55 pm
restarting the node which is in the resolving state fixed the issue for me .
Reply
Gopinath M
April 19, 2022 11:51 pm
When I dug further I found that non-yielding came due to a restore command. Can u please let me know how to find the restore command?
Reply