My client has set up three nodes availability group with one primary, one synchronous secondary and one asynchronous secondary. They noticed that redo queue size was increasing continuously on secondary replica (both of them, sync and async)
I started digging using DMV and found that redo process was working but it was waiting for the LATCH on APPEND_ONLY_STORAGE_FIRST_ALLOC. Here is the query used
SELECT * FROM sys.sysprocesses WHERE dbid = db_id('ProdDB)
Then I looked into SQL Server ERRORLOG and found below message logged continuously
The version store is full. New version(s) could not be added. A transaction that needs to access the version store may be rolled back. Please refer to BOL on how to configure tempdb for versioning.
Based on my understanding of secondary version store, it would come into picture when secondary is readable. All queries that run against the secondary databases are automatically mapped to snapshot isolation transaction level, even when other transaction isolation levels are explicitly set.
I found that they have a serious space issue on the drive where TempDB is located. The TempDB was not able to grow and hence new versions were not getting generated.
Due to readable secondary, SQL was using version store so the quick solution, in this case, was to disable read from secondary replicas. We used below command to disallow reads from secondary.
USE [master] GO ALTER AVAILABILITY GROUP [PROD_AG] MODIFY REPLICA ON N'Godzilla_2' WITH (SECONDARY_ROLE (ALLOW_CONNECTIONS = NO)) GO
We did this for both replicas. As soon as we disabled the reads, version store vanished, and redo picked up the speed. Now since TempDB was not in use, we were able to shrink it as well to avoid space issues on the drive which was having TempDB database.
Reference: Pinal Dave (https://blog.sqlauthority.com)