During my On Demand (50 Minutes) consultancy, I solve the issue which seems quick to my client. SQL not starting, AlwaysOn not failing over, Cluster not working are few of quick things where my clients engage me. In this blog, I would share a situation where Always On Availability Group was not coming online due to error – Did not find the instance to connect in SqlInstToNodeMap key.
There was some instability in a cluster which caused few unexpected failovers of always-on availability group from node1 to node2 – back and forth sometimes. When they contacted me, we found that clustered resource for availability group was not coming online.
My first step, always, is to get the error what is being reported by SQL or Cluster or Windows. Event log reported below error:
Cluster resource ‘PRODAG’ of type ‘SQL Server Availability Group’ in clustered role ‘PRODAG’ failed.
Based on the failed policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Above error is very generic and does not tell more than what we know already.
When I checked the SQL Server Management studio we saw that the secondary replica is not connected to the primary replica. The connected state is “DISCONNECTED” in DMV and it shows “red” symbol for this replica. Next step was to generate a Cluster log.
And BINGO! We were able to see some relevant messages there.
INFO [RES] SQL Server Availability Group <PRODAG>: [hadrag] The DeadLockTimeout property has a value of 300000
INFO [RES] SQL Server Availability Group <PRODAG>: [hadrag] The PendingTimeout property has a value of 180000
ERR [RES] SQL Server Availability Group <PRODAG>: [hadrag] Did not find the instance to connect in SqlInstToNodeMap key.
ERR [RHS] Online for resource PRODAG failed.
“ERR” is the tag I look for in cluster log and you should focus on. Just before failure, we see this error: Did not find the instance to connect in SqlInstToNodeMap key. I search and found that SqlInstToNodeMap is a registry key which should have the same information as sys.dm_hadr_instance_node_map.
When I checked the primary replica, we were not able to see the AG under “availability group” node in SSMS. Also, there were no replicas listed under “availability replica” node. When we tried querying sys.dm_hadr_database_replica_states, we did not get any results.
All above symptoms mean that there is some metadata mismatch between information in cluster and information in SQL Server. Even both replicas are having a mismatch of information about availability group. We ran below command on secondary to remove information about AG. We were not able to use UI and it was giving an error.
DROP AVAILABILITY GROUP PRODAG
As soon as we executed, the databases were in restoring state and AG information was cleared from all DVMs and cluster also. Then we recreated the availability group using the AG wizard and we were back in business in less than 20 min of call with me.
I truly hope that this blog can help someone who is getting the same issue with AG.
Reference: Pinal Dave (https://blog.sqlauthority.com)