When I work with customers, there are situations when I get chance to learn something from them. I was engaged with an AlwaysOn availability group engagement and got some interesting information from a customer which I am sharing here. In this blog, we would learn about how to solve event id 1135 – Cluster node ‘NodeName’ was removed from the active failover cluster membership.
Here are two “Critical” errors which you might see in System Event logs:
Event ID: 1135
Message: Cluster node ‘N2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Event ID: 1177
Message: The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Based on my knowledge about clustering, event Id 1135 indicates that the heartbeat communication failed between some nodes. It could be mostly the network connection or communication is failed among the cluster nodes. Next, the event 1177 indicates that fail-over occurred since the network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Of course, your networking team needs to be engaged first to understand the root cause of network issue. If it is happening on random basis and network team has no clue about it then here are few things which DBA can also do.
$cluster = Get-Cluster $cluster.SameSubnetDelay=2000 $cluster.SameSubnetThreshold=10 $cluster.CrossSubnetThreshold=10 $cluster.CrossSubnetDelay=4000
Along with cluster setting, one of my clients also told me to disable TCP offloading and few more properties. As per him, they might cause network delays and intermittent failures. You can run the following commands in the CMD (run as administrator) on all nodes.
Netsh int tcp set global chimney=disabled Netsh int tcp set global rss=disabled Netsh int tcp set global netdma=disabled Netsh int tcp set global autotuninglevel=disabled netsh interface teredo set state disabled netsh int ipv4 set global taskoffload=disabled
Also, update the NIC drivers, firmware, and teaming software (if there is) on all cluster nodes.
Above steps have solved the issue for them on several servers and they gave me permission to blog. If above steps solve the issue, please comment and let them know.
Reference: Pinal Dave (https://blog.sqlauthority.com)
Thanks a lot for this post , But Before this try we must implement this Large packet loss at the guest operating system level on the VMXNET3 vNIC in ESXi. https://kb.vmware.com/s/article/2039495. it will reduce the massive unexpected fail over in Always-on setup with Windows Cluster . still this is work around not solution. i am gonna try this solution for fixing my problem . i will keep you posted ,my feedback. stay tuned :).
this workaround , fixed the issue????
anyone apply this fix and it worked ?