SQL SERVER – AlwaysOn Availability Group Stuck in Resolving State For Long time

This is one of the common issues which I heard from my clients using SQL Server AlwaysOn and they don’t know why it happens.

SQL SERVER - AlwaysOn Availability Group Stuck in Resolving State For Long time 108-1

What is resolving state in SQL Server AlwaysOn?

When there is an availability group, the replica would be either in primary state or secondary state – when its online in failover cluster manager. Resolving is an intermediate state when the transition is happening from primary to secondary or vice versa. If due to some reason the transition is not successful, it goes to “resolving” state. In this state, the database is not accessible.

Solarwinds

What you should do to find the cause?

I always ask for ERRORLOG in such situations to know the series of events. Where is ERRORLOG? Various Ways to Find its Location

2016-02-22 17:36:14.75 Server      ***Unable to get thread context for spid 0
2016-02-22 17:36:14.75 Server      * *******************************************************************************
2016-02-22 17:36:14.75 Server      *
2016-02-22 17:36:14.75 Server      * BEGIN STACK DUMP:
2016-02-22 17:36:14.75 Server      *   04/22/15 17:36:14 spid 7416
2016-02-22 17:36:14.75 Server      *
2016-02-22 17:36:14.75 Server      * Non-yielding Scheduler
2016-02-22 17:36:14.75 Server      *
2016-02-22 17:36:14.75 Server      * *******************************************************************************
2016-02-22 17:36:14.76 Server      Stack Signature for the dump is 0x000000000000025C
……
2016-02-22 17:36:26.19 Server      Error: 19407, Severity: 16, State: 1.
2016-02-22 17:36:26.19 Server      The lease between availability group ‘ProdAG’ and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
2016-02-22 17:36:26.20 Server      AlwaysOn: The local replica of availability group ‘ProdAG’ is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.

When I dug further I found that non-yielding came due to a restore command. The restore contributed to the IO load on the server, causing a non-yielding scheduler to occur within the SQL Server.  This kept the SQL Server from renewing the lease with the cluster, causing the lease timeout for the availability group, which then caused the availability to go into a resolving state

Along with Errorlog, it’s always good to have Windows event logs (Application and System) and Cluster log from all participating nodes.

Have you ever faced such situation where the availability group was in resolving error?

If you like this blog post, here is another blog post where we have discussed how you can add files to the database in the AlwaysOn Availability Group.

Reference: Pinal Dave (https://blog.sqlauthority.com)

Solarwinds
, ,
Previous Post
SQL SERVER – 2016 – Check Value as JSON With ISJSON()
Next Post
SQL SERVER – Unable to start SQL Server Agent – Failed to Initialize SQL Agent log

Related Posts

16 Comments. Leave new

  • There is no cluster log in Windows 2008 onwards, you can export cluster events via powershell to a log and this covers all nodes.
    If secondary databases are stuck in resolving state check the status of the AO cluster role and its resource
    Regards Perry

    Reply
  • Good lost.
    Yes I did faced such situation.
    I start all over again the AG.

    Reply
  • I would have thought the db neededs to be taken out of the AG for the restore to initiate and tge resolving state would have occurred as a result at the rest of the nodes.

    Reply
  • Rebooting each node will fix this issue. it helped me to fix mine.

    Reply
  • Hi Pinal, If this often what would be a possible resolution.

    Reply
  • Hi Pinal,
    While doing a maintenance plan, DBCC CHECKDB find a corrupion on a database it produce a STACK DUMP and cause my AG to failover…
    It is the correct behaviour in case of maintenace plan?

    Reply
  • yogigollapudi
    May 16, 2017 3:21 am

    Hi pinal, on my AG environent , Ag resource went to offline because of least timeout and automatic failover doesnot happened. i did manual failover after force start of cluster and Ag resource. now the databases are online. how to avoid this scenario in future?

    Reply
  • yogigollapudi
    May 16, 2017 3:31 am

    ** Lease timeout not a least timeout

    Reply
  • Vikash Singh
    July 22, 2017 1:58 am

    Hi Pinal,

    I have come across a situation where a SELECT statement running on Primary goes to ROLLBACK when killed.
    Could you please suggest a way to troubleshoot this ?

    Reply
  • Hi Pinal,

    I have a situation where secondary goes into resolving state during automatic failover. Node1 to node2 automatic failover works fine but node2 to node1 does not work and node1 goes into resolving state forever. Manual failover works fine both way. Here are the error messages I see in event viewer.

    Event ID: 1069
    Cluster resource ‘XEN_DESKTOP’ of type ‘SQL Server Availability Group’ in clustered role ‘XEN_DESKTOP’ failed.

    Event ID: 1205
    The Cluster service failed to bring clustered role ‘XEN_DESKTOP’ completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

    Event ID: 1254
    Clustered role ‘XEN_DESKTOP’ has exceeded its failover threshold.

    Reply
  • Hello, I know how to look at the logs but I cannot fix the issue. You don’t leave us any resolution or suggestions so unfortunately I don’t find this helpful

    Reply
  • I reboot the secondary replica and then AG start to synchronizing again

    Reply

Leave a Reply

Menu