One of the main goals in architecting a Disaster Recovery (DR) solution is to make a DR failover transparent to the end users. Too often, users must reboot their desktops, clear their browser cache and the jinitiator jar cache, and so on, even when we have made sure that the post-failover URL of the 11i instance is the same. After a failover of an 11i instance from a primary site to a DR site, if the user can operate without changing anything in his desktop, only then can we say that the goal is achieved.
In most cases the culprits are: forgetting the DNS setup for the hostnames of Middle Tiers, or the load balancer, if one is used; and the caching of DNS entries at the different levels in the network. A quick look at the caching section of Wikipedia’s page on DNS gives some idea of I’m talking about. Because of the default settings, the old IP address gets cached in the user’s desktop and in caching DNS servers in the network. As a result, the user’s desktop is still trying to reach the old server, which is now offline.
The best fix for these kind of DNS side effects is to change the TTL (Time To Live) parameter of the DNS entry for the hostname from the default value to a smaller one. I prefer setting it to a value a little smaller than the time you take to failover. That is, if you take 60 minutes to failover from Primary to Secondary datacenter, then set the TTL to 50 minutes.
Let’s take an example here. Let’s say our 11i instance has the URL
https://apps.example.com:8000, the primary instance being
windsor, the secondary
ottawa. And we have two load balancers: one at primary site and one at the secondary, with hostnames
lb.ottawa.example.com respectively. If the DNS is set up with default values, it will look like this:
hostname TTL Type value ---------------------------------------------- apps.example.com 86400 CNAME lb.windsor.example.com lb.windsor.example.com 86400 A 192.168.1.100 lb.ottawa.example.com 86400 A 192.168.2.100
apps.example.com is an alias
lb.windsor.example.com and the
TTL value is set to 86400 seconds, i.e., 24 hours. That means this record gets cached for a duration of 24 hours at the user’s desktop and at any caching DNS servers being used by the client. So at the time of failover, even though we change the DNS records of
apps.pythian.com to point to the
ottawa load balancer instead of
windsor, because the
TTL is set to a very high value of 24 hours, the user’s browser will still be trying to reach the primary site load balancer, as it is cached in their desktop for next 24 hours
As I suggested earlier, if we set the
apps.example.com to 50 minutes (3000 seconds) and do the changes to DNS as first step in the failover procedure, then by the time we finish (which is supposed to be 60 minutes), the old DNS records in the user’s desktop cache and the caching DNS server will have expired, and they will start seeing the new alias for
hostname TTL Type value ---------------------------------------------- apps.example.com 3000 CNAME lb.ottawa.example.com lb.windsor.example.com 86400 A 192.168.1.100 lb.ottawa.example.com 86400 A 192.168.2.100
Some of you might already be thinking, why not set it to even lower values, like 5 minutes? The main problem with setting it to a lower value such as this is that it will increase the load on the DNS server. If you have a single DNS server with too low values, any kind of outage on DNS server will effect your users immediately, as their desktops will be making DNS lookups much more frequently than before. So in cases where you have low TTL settings, make sure you have at least two DNS servers at two different locations.
Please feel free to post your experiences related to DNS in the comments section. Any comments or suggestions are welcome!