This is one of the common errors I have seen while working with customers who are using SQL Server AlwaysOn availability groups. Once this error comes, the resource in the cluster goes to failed state and in SQL Server Management Studio, we should be an availability group in resolving state. Resolving state essentially means that the role of availability group is neither primary nor secondary.
Here is the snippet from ERRORLOG when the lease expires. I have tried to explain the meaning of each line.
2017-02-27 19:31:07.34 Server Error: 19407, Severity: 16, State: 1.
2017-02-27 19:31:07.34 Server The lease between availability group ‘PRODAG’ and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
Above message means the lease between windows cluster and SQL Server.
2017-02-27 19:31:07.34 Server AlwaysOn: The local replica of availability group ‘PRODAG’ is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
Due to lease renewal failure, the AG resource in the cluster would go to failed state.
2017-02-27 19:31:07.34 Server The state of the local availability replica in availability group ‘PRODAG’ has changed from ‘PRIMARY_NORMAL’ to ‘RESOLVING_NORMAL’. The state changed because the lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired. For more information, see the SQL Server error log, Windows Server Failover Clustering (WSFC) management console, or WSFC log.
Since AG is failed, the AG state in SQL Server would change from PRIMARY to RESOLVING_NORMAL. At the same time, on secondary, we should see state from SECONDARY_NORMAL to RESOLVING_NORMAL
Now the next challenge would be to find WHY lease was expired.
Based on my research on the internet, I found that in most of the cases, the lease gets expired due to shortage of resources on the machine. You can think of this as a “momentarily hang” of windows operations. Generally, we should look at the cause of slowness. The client with whom I worked, I could see tons of IO related messages like below.
2017-02-27 19:23:26.09 spid35s SQL Server has encountered 244 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [M:\MSSQL\SAM_PRODUCTION.MDF] in database id 21. The OS file handle is 0x0000000000001B44. The offset of the latest long I/O is: 0x00004fcfb76000
I have explained above issue in below blog SQL SERVER – WARNING – SQL Server Has Encountered N Occurrence(s) of I/O Requests Taking Longer Than 15 Seconds
We also saw below, which again points to IO slowness.
2017-02-27 19:31:05.09 spid18s average writes per second: 1225.28 writes/sec
average throughput: 86.19 MB/sec, I/O saturation: 149511, context switches 190629
2017-02-27 19:31:05.09 spid18s last target outstanding: 278, avgWriteLatency 38
Till you find the actual cause, you can increase LeaseTimeout value so that the AG is remaining healthy
But remember that we have not fixed the issue, but applied band-aid.
Here are few more things to do.
- Limit Max Server Memory of SQL Server, if not capped.
- Consult your storage team for storage performance issues, since we see many IO stalled messages.
- Enable lock pages in memory. This will prevent work set trimming and prevent it from being paged out. Please refer to the below link https://msdn.microsoft.com/en-IN/library/ms190730.aspx
Have you found something more than above?
Reference: Pinal Dave (http://blog.SQLAuthority.com)