As a part of my passion, I try to help people in fixing simple issues and explain the behavior by looking at logs. Here is an email from my friend about Availability Group Missing or Deleted Automatically.
We need your services again to help us understand the cause of a strange issue.
We had a network issue earlier today. After that we observed that there were two availability groups that were not only down, but also no longer even listed under the Availability Groups in SQL Server Management Studio. When we checked failover cluster manager, they do show up under Services and Applications in the Failover Cluster Manager but in failed states. It is interesting that on the node that has missing AGs, the databases show up as “Restoring”
Just to make sure it’s not a ghost effect, we have tried SQL Server service restarts and reboots but the fault is unchanged.
Any quick ideas?
Since this was an interesting issue, I asked John to share the SQL Server ERRORLOG SQL SERVER – Where is ERRORLOG? Various Ways to Find ERRORLOG Location
I looked into ERRORLOG and found below interesting message on the replica where AG was missing.
2017-02-26 13:26:43.08 spid38s Error: 19435, Severity: 16, State: 1.
2017-02-26 13:26:43.08 spid38s AlwaysOn: WSFC AG integrity check failed for AG ‘PROD_AG’ with error 41041, severity 16, state 0.
2017-02-26 13:26:43.08 spid38s AlwaysOn: The local replica of availability group ‘PROD_AG’ is being removed. The instance of SQL Server failed to validate the integrity of the availability group configuration in the Windows Server Failover Clustering (WSFC) store. This is expected if the availability group has been removed from another instance of SQL Server. This is an informational message only. No user action is required.
2017-02-26 13:26:43.08 spid38s The state of the local availability replica in availability group ‘PROD_AG’ has changed from ‘SECONDARY_NORMAL’ to ‘NOT_AVAILABLE’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.
From the above message, I think, SQL was not able to communicate with Windows cluster and as a safe guard the availability group was removed from SQL.
If this happens, there is no data loss. We just need to recover databases, if needed and drop AG completely. Once cleanup is successful, it should be easy to reconfigure availability group.
Reference: Pinal Dave (http://blog.SQLAuthority.com)
Thanks for the post. I ended up here because I’m running into a similar problem. Except I feel you post only acknowledges the problem, without providing any insight as to what may have caused the issue.
In my case, I’m trying to setup an Availability Group on Azure VMs. The failure occurs upon deployment, when attempting to run the CREATE AVAILABILITY GROUP command, although my automated scripts have no issue whatsoever when ran in a local DEV environment.
I am currently working of workarounds in order to get things working in Azure. The main issue is with the way Azure handles IP addresses and routing. All IP addresses used in Azure VMs need to be defined in Azure, otherwise they are ignored. Therefore, the cluster IP address which moved around from one node to the other is simply not reachable. I’ve found posts about how to set up a Internal Load Balancer hack, which consists of setting up the cluster IP address as the FrontEnd address of the ILB, then setting up probes and rules for the ILB to be able to detect which of the nodes is currently the primary node, and route traffic accordingly.
I got this hack to work.. so far… I am now able to have node 2 join the cluster, set up CAU, etc… Now I run into the problem of running into an error when trying to create the availability group. I believe it is probably related to the Cluster IP address routing issue, or something similar. Here are the actual error messages I get in my “Proof Of Concept” environment:
– The state of the local availability replica in availability group ‘ag_POC_PROD’ has changed from ‘NOT_AVAILABLE’ to ‘RESOLVING_NORMAL’. The state changed because the local availability replica is joining the availability group. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
– The state of the local availability replica in availability group ‘ag_POC_PROD’ has changed from ‘RESOLVING_NORMAL’ to ‘NOT_AVAILABLE’. The state changed because either the associated availability group has been deleted, or the local availability replica has been removed from another SQL Server instance. For more information, see the SQL Server error log or cluster log. If this is a Windows Server Failover Clustering (WSFC) availability group, you can also see the WSFC management console.
– Error: 19435, Severity: 16, State: 1.
– Always On: AG integrity check failed for AG ‘ag_POC_PROD’ with error 41044, severity 16, state 1.
I’m working on it… Googling the error, I ended up here.
I know I’ll have to apply a similar Load Balancer hack for the Listener. But I haven’t reached the point yet where I can create the listener, having not been able to even create the availability group.
Back to your friend John: If my gut feeling is right, he probably ran into his problem after some soft of network issue. Maybe a switch or router fault or misconfiguration has caused a similar error pattern to the one I’m running into, and has caused the node to “disconnect”.
Follow up: If anyone else is running into the same issue as I was, here’s how I finally managed to solve the problem for my specific case:
After reviewing the Cluster logs, I found the following errors:
ODBC Error:  [Microsoft][SQL Server Native Client 11.0][SQL Server]The user does not have permission to perform this action. (297)
Failed to run diagnostics command. See previous log for error message
Disconnect from SQL Server
It turned out that after solving the network issues, the problem now was improper permissions. After a little research, I ended up running the following SQL commands:
GRANT ALTER ANY AVAILABILITY GROUP TO [NT AUTHORITYSYSTEM];
GRANT CONNECT SQL TO [NT AUTHORITYSYSTEM];
GRANT VIEW SERVER STATE TO [NT AUTHORITYSYSTEM];
Afterward, I was able to create the Availability Group.
What troubles me is that I never had to grant those permissions under my local DEV environment, but here under Azure, I had to.
This is amazing – I wish you can write a blog post on this subject and I will publish it with due credit to you.
I have configured SQL server always on, it working properly, had configured listener too listener is able to communicate from any of the cluster node, its IP is online…but I am unable to do the failover of SQL server availablity group, its failing with timeout error 41131.
1)I have granted the access to endpoints to NT authority system
2)port 5022 is listening from both the servers
3)have also granted access to endpoints
4)error code 1257 ,1258 is getting recorded in cluster logs for CNO permissions, but I am getting same in other servers while doing failover but its getting failed over (so not sure if its exactly a cause of issue)
Is anyone out here who can comment a help here to me
am also facing the same issue any solution?..no option to join the db’s in secondary as ag group missing in secondary.do we need to reconfugure the ag again?