SQL SERVER – Unable to Start SQL Resource in Cluster – HUGE Master Database!

One of my clients contacted me and informed that SQL in cluster failed over from Node 01 to Node 02. It failed to come online on Node 02, failed back to Node 01, and it failed to come online there either. They contacted me to know why they are not able to start SQL Server in the cluster. Let us learn how to start the SQL Resource in Cluster when there is a huge master database.

Their expectation from me was to know why this is happening and what should they do to resolve it so that you’re able to bring your SQL Server online.

To fix such issue, I always start from ERRORLOG file. If you are new to SQL Server, then refer below blog.

SQL SERVER – Where is ERRORLOG? Various Ways to Find ERRORLOG Location

As per ERRORLOG, I could see that recovery of the master database was happening for a long time

Solarwinds

2018-06-22 20:00:01.390 spid6s       Recovery of database ‘master’ (1) is 3% complete (approximately 2001 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

2018-06-22 20:00:06.260 spid6s       Recovery of database ‘master’ (1) is 3% complete (approximately 1995 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

and it was clear that cluster service was shutting down SQL while the master was recovering. When we looked at physical files on the operating system, we could see master MDF and LDFs size was HUGE.

SQL SERVER - Unable to Start SQL Resource in Cluster - HUGE Master Database! huge-mast-01-800x177

SOLUTION/WORKAROUND

Now the only option I thought of was to stop SQL and Start via command prompt and bypass cluster. As expected, it took around 32 minutes to recover the master database (we were watching ERRORLOG continuously).

  1. Start SQL Service via command

NET START MSSQLSERVER

  1. Then we need to monitor the ERRORLOG file.
  2. Once recovery is complete, connect to SQL via SSMS.
  3. Run below queries.
DBCC loginfo
GO
SELECT log_reuse_wait_desc
,recovery_model_desc
,name
FROM sys.databases
WHERE database_id = 1
  1. Make sure recovery model is simple (in second output)
  2. Make sure we have less number of VLFs and all are having status as zero.
  3. Shrink the master database. Here is another blog on the same topic.
    SQL SERVER – master Database Log File Grew Too Big
  4. Check for any abnormal big sized table.
  5. Stop SQL service via command prompt.

NET STOP MSSQLSERVER

Now, we started SQL in the cluster and it worked fine. SQL Server was failing to start because the master database was stuck in recovery. When the cluster service tried to connect to it to perform the health check, it failed to do so, assumed SQL was down and failed the SQL resource in the cluster. Later, we found that application is configured to use the master database for its temporary operations. I ask them to work with application vendor to modify their code and avoid using master.

Have you seen such situation earlier? How did you fix it? Please share via comments with others.

Reference: Pinal Dave (https://blog.sqlauthority.com)

Solarwinds
, , , ,
Previous Post
SQL SERVER – Alternate Rows Color in Table Background in HTML
Next Post
SQL SERVER – Execution Plan Ignores Tabs, Spaces and Comments

Related Posts

3 Comments. Leave new

  • I have a tbl LeaveRequest in which all employee leave requests are captured. I need to fetch the records of a employee who had availed Bereavement leave in the last one year

    Reply
  • Wow that is crazy.

    Reply
  • you can start sql server the command” NET START MSSQLSERVER when the cluster is failed?

    Reply

Leave a Reply

Menu