We need your services again to help us understand the cause of a strange issue.
We had a network issue earlier today. After that we observed that there were two availability groups that were not only down, but also no longer even listed under the Availability Groups in SQL Server Management Studio. When we checked failover cluster manager, they do show up under Services and Applications in the Failover Cluster Manager but in failed states. It is interesting that on the node that has missing AGs, the databases show up as “Restoring”
Just to make sure it’s not a ghost effect, we have tried SQL Server service restarts and reboots but the fault is unchanged.
Any quick ideas?
Since this was an interesting issue, I asked John to share the SQL Server ERRORLOG SQL SERVER – Where is ERRORLOG? Various Ways to Find ERRORLOG Location
I looked into ERRORLOG and found below interesting message on the replica where AG was missing.
2017-02-26 13:26:43.08 spid38s Error: 19435, Severity: 16, State: 1.
2017-02-26 13:26:43.08 spid38s AlwaysOn: WSFC AG integrity check failed for AG ‘PROD_AG’ with error 41041, severity 16, state 0.
2017-02-26 13:26:43.08 spid38s AlwaysOn: The local replica of availability group ‘PROD_AG’ is being removed. The instance of SQL Server failed to validate the integrity of the availability group configuration in the Windows Server Failover Clustering (WSFC) store. This is expected if the availability group has been removed from another instance of SQL Server. This is an informational message only. No user action is required.
2017-02-26 13:26:43.08 spid38s The state of the local availability replica in availability group ‘PROD_AG’ has changed from ‘SECONDARY_NORMAL’ to ‘NOT_AVAILABLE’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.
From the above message, I think, SQL was not able to communicate with Windows cluster and as a safe guard the availability group was removed from SQL.
If this happens, there is no data loss. We just need to recover databases, if needed and drop AG completely. Once cleanup is successful, it should be easy to reconfigure availability group.
Reference: Pinal Dave (http://blog.SQLAuthority.com)