One of my earlier clients to whom I helped in configuration Always On Availability Group came back to me with strange behavior. In this blog, we would discuss error Version store is full. New version(s) could not be added. In this situation, I also found that the redo queue was increasing continuously on the secondary replica.
My client has set up three nodes availability group with one primary, one synchronous secondary and one asynchronous secondary. They noticed that redo queue size was increasing continuously on secondary replica (both of them, sync and async)
I started digging using DMV and found that redo process was working but it was waiting for the LATCH on APPEND_ONLY_STORAGE_FIRST_ALLOC. Here is the query used
SELECT * FROM sys.sysprocesses WHERE dbid = db_id('ProdDB)
Then I looked into SQL Server ERRORLOG and found below message logged continuously
The version store is full. New version(s) could not be added. A transaction that needs to access the version store may be rolled back. Please refer to BOL on how to configure tempdb for versioning.
Based on my understanding of secondary version store, it would come into picture when secondary is readable. All queries that run against the secondary databases are automatically mapped to snapshot isolation transaction level, even when other transaction isolation levels are explicitly set.
Active Secondaries: Readable Secondary Replicas (Always On Availability Groups)
I found that they have a serious space issue on the drive where TempDB is located. The TempDB was not able to grow and hence new versions were not getting generated.
Due to readable secondary, SQL was using version store so the quick solution, in this case, was to disable read from secondary replicas. We used below command to disallow reads from secondary.
USE [master] GO ALTER AVAILABILITY GROUP [PROD_AG] MODIFY REPLICA ON N'Godzilla_2' WITH (SECONDARY_ROLE (ALLOW_CONNECTIONS = NO)) GO
We did this for both replicas. As soon as we disabled the reads, version store vanished, and redo picked up the speed. Now since TempDB was not in use, we were able to shrink it as well to avoid space issues on the drive which was having TempDB database.
Reference: Pinal Dave (https://blog.sqlauthority.com)
I am having this same issue your user was having but my tempdb has 162gb of space and is only 36gb on 6 6gb files.
I have very fast sendrates but then the redo rate is literally only 25k on this one server but I have 4 other nodes where servers have 120k-250k redo rates. Which causes my 1 server to fall behind on redo rate and I don’t know why :'(