We recently ran into an interesting issue on a three-node SQL Server Availability Group (AG) cluster. There was a cluster failure that impacted the availability group for a few minutes during which the replicas changed to resolving state. Once the cluster was back online, the replicas came back to normal primary and secondary state but multiple databases were still not synchronizing. In addition, the databases on the primary were not accessible.
- AG databases are not synchronizing.
- Databases on the primary replica are not available.
- Any attempt to change the availability group on the primary replica (for example suspend AG, remove AG database, change to manual failover mode) end up suspended with wait type HADR_AR_CRITICAL_SECTION_ENTRY.
- There may be messages in the SQL log of the primary replica showing processes being killed like below. These processes are stuck in killed/rollback state and blocking other background processes.
Process ID 77 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 54, object_id = 0.
- In our case, there was no issue on the secondary replicas. We could suspend AG on the secondary databases, no blocking or killed processes. The only issue was the secondary AG databases were not synchronizing due to primary databases not being available.
The only solution we found was to restart the SQL instance of the primary replica. However, even a normal restart of the SQL service from the config manager was hanging on “stopping service”. We had to force SQL to stop by using the TSQL command “SHUTDOWN WITH NOWAIT”. After bringing SQL back online, the databases were available again and AG was in sync and healthy.