One of my clients contacted me and informed that SQL in cluster failed over from Node 01 to Node 02. It failed to come online on Node 02, failed back to Node 01, and it failed to come online there either. They contacted me to know why they are not able to start SQL Server in the cluster. Let us learn how to start the SQL Resource in Cluster when there is a huge master database.
Their expectation from me was to know why this is happening and what should they do to resolve it so that you’re able to bring your SQL Server online.
To fix such issue, I always start from ERRORLOG file. If you are new to SQL Server, then refer below blog.
As per ERRORLOG, I could see that recovery of the master database was happening for a long time
2018-06-22 20:00:01.390 spid6s Recovery of database ‘master’ (1) is 3% complete (approximately 2001 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
2018-06-22 20:00:06.260 spid6s Recovery of database ‘master’ (1) is 3% complete (approximately 1995 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.
and it was clear that cluster service was shutting down SQL while the master was recovering. When we looked at physical files on the operating system, we could see master MDF and LDFs size was HUGE.
Now the only option I thought of was to stop SQL and Start via command prompt and bypass cluster. As expected, it took around 32 minutes to recover the master database (we were watching ERRORLOG continuously).
- Start SQL Service via command
NET START MSSQLSERVER
- Then we need to monitor the ERRORLOG file.
- Once recovery is complete, connect to SQL via SSMS.
- Run below queries.
DBCC loginfo GO SELECT log_reuse_wait_desc ,recovery_model_desc ,name FROM sys.databases WHERE database_id = 1
- Make sure recovery model is simple (in second output)
- Make sure we have less number of VLFs and all are having status as zero.
- Shrink the master database. Here is another blog on the same topic.
SQL SERVER – master Database Log File Grew Too Big
- Check for any abnormal big sized table.
- Stop SQL service via command prompt.
NET STOP MSSQLSERVER
Now, we started SQL in the cluster and it worked fine. SQL Server was failing to start because the master database was stuck in recovery. When the cluster service tried to connect to it to perform the health check, it failed to do so, assumed SQL was down and failed the SQL resource in the cluster. Later, we found that application is configured to use the master database for its temporary operations. I ask them to work with application vendor to modify their code and avoid using master.
Have you seen such situation earlier? How did you fix it? Please share via comments with others.
Reference: Pinal Dave (https://blog.sqlauthority.com)