All RDBMS products are good, but I love SQL Server most because its fun to troubleshoot the issue. Meaningful error messages can make life easier and helps us in fixing the issue faster. Another reason I love SQL Server is that is my source of earning and I can run my family life due to SQL Server. In this blog, I am going to share my experience to fix an issue where my client was unable to failover SQL Server to another node.
If you search the internet, you would find many probable causes of failover issue. As I mentioned earlier, the exact cause can be determined with an error message and logical reasoning. To hunt right error message, I started looking at various logs.
SQL Server ERRORLOG:
You can refer to below blog which talks about ERRORLOG.
<DateTime> spid20s Service Broker manager has shut down.
<DateTime> spid11s SQL Server is terminating in response to a ‘stop’ request from Service Control Manager. This is an informational message only. No user action is required.
<DateTime> spid11s SQL Trace was stopped due to server shutdown. Trace ID = ‘1’. This is an informational message only; no user action is required.
From above we can see that it was not a SQL Server crash or a failure. SQL Server received a STOP request from Service Control manager due to which it went down. Which means SQL Server had started successfully, but due to some issues, it was asked to shut down. Now, remember this is a cluster. There are 2 stages involved to get the SQL Server successfully started in a cluster.
- Get SQL Server service locally started.
- Get SQL Server resource online in failover cluster manager.
If any of the 2 stages fail, SQL Server will go to failed state or it will fallback to the other node. In the current situation, it looked like it failed in the 2nd stage. Because it was the 2nd stage, it makes all the sense to look at the cluster log to check why it failed.
This file needs to be generated manually. Refer my earlier blog to know the steps. SQL SERVER – Steps to Generate Windows Cluster Log?
Let me simplify with only the main errors of interest. Generally, you need to focus on lines having “ERR” in them.
- [Microsoft][SQL Server Native Client 10.0]Registry information is corrupt or missing. Make sure the provider installed and registered correctly.
- [Microsoft][SQL Server Native Client 10.0]Client unable to establish a connection
- [Microsoft][SQL Server Native Client 10.0]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.
The errors are talking about registry information missing or corrupted. Luckily, the client had another SQL Server instance present in the same cluster. They were OK to test a quick failover. As I was suspecting in my mind, that SQL instance too failed to come online, and the errors were the same. So, this was an issue related ONLY to the NODE 2. Then when I was comparing the settings between both the nodes, I noticed that it with the 64bit Client Protocols were missing on Node2.
Based on my search on the internet, these protocols are represented by the registry key:
As expected the above key was not present on Node2 and was present in Node1. Now I knew that why we did not have any issues on Node1. Without making the solution too complicated, we extracted the above-mentioned registry key from Node1 and imported it on Node2. After that, we were able to successfully failover the SQL Server instance. I must add that registry modification needs to be done with caution because there is no undo possible without backup. That’s why I always back up the registry before taking any such actions.
Reference: Pinal Dave (https://blog.sqlauthority.com)