Availability Group – Delay with Listener Connectivity After Failover

Posted in: Microsoft SQL Server, Technical Track

The beauty of working for multiple clients from different industries, is that you get exposed to a myriad of environment setup and configuration. Every company has its own standards for Network and Server configuration, as well different hardware vendors.  This introduces their own kinks and excitement to your everyday work – half of which you’ll likely not encounter if you are working in-house and using the same hardware.

The past week we encountered a rare and interesting issue with High Availability Group. The issue was two-fold, first it was initially not failing over automatically one one node though that was not as exciting as the second part – when it was able to failover correctly, our client was experiencing delays with the availability of the Listener Name outside its own sub-net  after failover, automatic or otherwise.  It is reachable within its own subnet but takes more than thirty minutes to be reachable outside of it even though the Failover happened smoothly and without error.

The first part was fairly straightforward. Checking on the cluster logs and event logs the automatic failover was throwing the error below when trying to failover on one of the nodes.

Cluster network name resource 'Listener_DNS_NAME' failed registration of one or more associated DNS name(s) for the following reason:
DNS operation refused.
.

Ensure that the network adapters associated with dependent IP address resources are configured with at least one accessible DNS server.

The error is as it says, the Computer object does not have the appropriate permissions on the Domain to register the DNS Name Resource for the Listener.  For the cluster  to perform this operation smoothly “Authenticated Users” should have read/write all permissions on the Computer Object for the cluster, its nodes and the Listener DNS Name. To do this, Log in to the Active Directory Server

  1. Open Active Directory Users and Computers.
  2. On the View menu, select Advanced Features.
  3. Right-click the object  and then click Properties.
  4. On the Security tab, click Advanced to view all of the permission entries that exist for the object.
  5. Verify that the Authenticated Users is in the list and has the permission to Read and Write All. Add the required permissions then Save the changes.

Now after doing that and testing the fail over, it is now encountering a different error, Kerberos-related one showed below.

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server ComputerName$. The target name used was HTTP/ComputerName.Domain.com. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (Domain.COM) is different from the client domain (Doamin.COM), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Ah, the often over-looked SPN. This should be part of your installation process – setting the SPN. To keep the story short and so we can get to the exciting part, you can refer here for the detailed instructions on how to configure the SPN for SQL Server.  Aside from registering the SPN for each of the Nodes as specified on the MSDN Link, You’ll also need to register the SPN for the Listener, as always 1433 is the port being used by your SQL Server:

setspn -A MSSQLSvc/Listener_DNS_NAME.Domain.com:1433 DOMAIN/SQLServiceAccount

This will enable Kerberos for the client connection to the Availability Group Listener and address the errors we received above. After configuring the SPN for the servers Automatic Fail over is now running smoothly, or so we thought.

The client came back to us that it was taking some time for the application to connect to the Listener Name. Checking on the cluster logs and SQL Server’s end, everything appears to be in order. No errors being thrown and Server is accessible. Now it get’s interesting. Ping test within the Database sub net is successful but ping test outside of it was timing out. It was able to connect though after a substantial amount of time.  After a few more test and checking the behavior is the same on both nodes. It takes more than thirty minutes for the Name to be reachable outside of the database sub net.  After involving the Network Admin we found out that  a MAC Address conflict is happening.  That’s our “Aha!” moment.  Windows 2003 servers and later issues a Gratuitous ARP (GARP) requests during failover. There are some switches/devices that does not forward Gratuitous ARP by default.   This causes the devices on the other end of the switch to not have the correct MAC address associated to the Name.  This causes the failure. It often corrects itself when the router detects the failures and do a broadcast and gets the correct value. That’s why it becomes accessible after some time. This KB details the issue. To address this, changes must be done on the configuration of the switches, you’ll need to check with your hard ware vendor for this.

Now, that could be the end of our problems, but after enabling the switch to forward GARP, we found out that the server itself is not sending a GARP request.  This is a server configuration issue and requires some Registry changes. Open the Registry for the server and locate the key below:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

From there check if there is a key for ArpRetryCount, if there is make sure that the Value is not set to 0. The value could be between 0-3. After changing this and restarting the servers Everything works perfectly.

Last two issues are a bit rare, and something I wouldn’t have experienced if the client wasn’t using that particular hard ware and that particular standard configuration.

email

Author

Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Lead Microsoft Data Platform Consultant
Pio Balistoy is a Microsoft MVP for Data Platform from Singapore. He has been a Database professional for more than 17 years. He brings his passion for SQL to the community by being one of the Community leads for both Philippine Data Platform Forums (formerly Philippine SQL Server User Group) and Singapore SQL PASS.

2 Comments. Leave new

Hi,
I have setup Alwayon in SQl 2016 in two server
SQl2016 edition i am using .
I have created listener
I am also below issue
I have below problem ,

If suppose my primary server SQL services down or windows server down then not detect automatic secondary server from website .

I am also changed webconfig file and add alwayon related .

SQL server change database mode but listner not changed server name .

Please do the needful.

Thanks,

Reply

I just want to say thanks for publishing this article. I had the same issue on a SQL 2016 cluster and was scratching my head for a long time trying to work out why my application was failing to connect to the AV Group Listener name but a server on my SQL subnet could. We passed this on to our networks team who made the relevant changes to our switch which resolved the issue. Happy Days!

Dan

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *