SQL SERVER – SQL Clustered Resource in Online Pending State for Long Time Before Coming Online

SQL
No Comments

SQL SERVER - SQL Clustered Resource in Online Pending State for Long Time Before Coming Online alwaysonerror While doing Comprehensive Database Performance Health Check I always ask my client if there is any pain point which they have with the current state of the database/server. Once I got an interesting question which I am going to answer in this blog post – Why is my SQL Clustered Resource in Online Pending state for a long time before coming online.

Before I show you how I found the cause, here are few earlier blogs where the situation was different where SQL was not coming online at all.

When SQL is in Online Pending state, the SQL Service is not fully ready for connection or unable to make a connection. SQL SERVER – Steps to Generate Windows Cluster Log?

Solarwinds

Here are a few important events:

  1. Here is the offline event of Node1

Log Name: Microsoft-Windows-FailoverClustering/Operational
Source: Microsoft-Windows-FailoverClustering
Date: 1/28/2018 1:49:29 PM
Event ID: 1204
Task Category: Resource Control Manager
Level: Information
User: SYSTEM
Computer: NODE1.domain.com
Description: The Cluster service successfully brought the clustered service or application ‘SQL Server (MSSQLSERVER)’ offline.

  1. Here is the online event on Node2

Log Name: Microsoft-Windows-FailoverClustering/Operational
Source: Microsoft-Windows-FailoverClustering
Date: 1/28/2018 1:57:11 PM
Event ID: 1201
Task Category: Resource Control Manager
Level: Information
User: SYSTEM
Computer: NODE2.domain.com
Description: The Cluster service successfully brought the clustered service or application ‘SQL Server (MSSQLSERVER)’ online.

If you observe closely, there is a gap of 8 minutes between above 2 events.

  1. If we look at cluster logs, we found below messages.
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Service status checkpoint was changed from 0 to 1 (wait hint 20000). Pid is 2431
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Service status checkpoint was changed from 1 to 2 (wait hint 20000). Pid is 2431

.. number kept on increasing continuously. 2 to 3, 3 to 4 and so on. Finally, after around 40 attempts it came online. There was a gap of 2 seconds time in each line.

  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Service is started. SQL Server pid is 2431
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Connect to SQL Server …
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] The connection was established successfully
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Diagnostics is started
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] Online worker helper is started
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] SQL Server component ‘system’ health state has been changed from ” to ‘clean’
  • SQL Server <SQL Server (MSSQLSERVER)>: [sqsrvres] SQL Server resource state is changed from ‘ClusterResourceOnlinePending’ to ‘ClusterResourceOnline’
  • Resource SQL Server (MSSQLSERVER) has come online. RHS is about to report status change to RCM
  • HandleMonitorReply: ONLINERESOURCE for ‘SQL Server (MSSQLSERVER)’, gen(0) result 0.
  • TransitionToState(SQL Server (MSSQLSERVER)) OnlinePending–>Online.

WORKAROUND/SOLUTION

When we looked at ERRORLOG, I found recovery messages for 8 minutes. We also found out that there was a huge number of VLF which seems like the root cause of the issue.

We learned that after reducing the count of VLF, by taking log backups and shrinking the log file, we were able to resolve the issue and SQL failover was very quick.

Reference: Pinal Dave (https://blog.sqlauthority.com)

Solarwinds
, , , ,
Previous Post
SQL Server Integration Services (SSIS) – There Was an Exception While Loading Script Task from XML
Next Post
SQL SERVER – Error 1051: A Stop Control Has Been Sent to a Service that Other Running Services are Dependent On

Related Posts

Leave a Reply

Menu