Failover Strategies - Experiment

There was just too much confusion about what various browsers did and did not do so this experimental setup was used to test the effect of multiple A RRs in browser (web) failover strategies.

If anyone wants to repeat this excercise using different configurations I would welcome email of the test results and will publish with appropriate kudos and credits. This particular configuration has no distinguishing characterics but did have the overriding merit of being quick to put together.

Experimental Software:

PC

Windows Server 2000, with gazillions of patches (basically latest updates as of Dec 2006). Standard MS TCP/IP stack. The registry variable HKLM\System\CurrentControlSet\services\DnsCache\Parameters\MaxCacheEntryTtlLimit was confirmed to be the default value of 24 hours (86400) (see MS KB article: 245437 - for XP and 2003 see KB article: 318803)

MSIE Browser

Standard MSIE 6.0.2800.1106 with gazillions of patches as of Dec 2006. No tools bar or BHOs were installed. The registry variables HKLM\Software\Microsoft\Windows\CurrentVersion\Internet Settings\DnsCahetimeout and ServerInfoTimeout were not set (which means they default to 30 minutes see MS KB article: 263558).

Firefox Browser

Firefox 2.0.0.1. The prefs value network.dnsCacheExpiration was not set (and therefore defaults to 60 seconds - see this article).

DNS Configuration

The DNS configuration was determined by the DSL router (via DHCP) and pointed to the DSL router i.e used a proxy service to a caching nameserver in the service providers network.

Network Sniffer

ethereal 0.99.0 running on the Windows 2000 machine was used to monitor all activity. This version was perfectly acceptable for the test. Though if it had been upgraded we would have used wireshark before anyone makes a comment.

DNS Servers

DNS Servers were BIND 9.3.2 running on FreeBSD 5.4

Web Servers

Web Servers were Apache 1.3.34_4 and 1.3.33_1 both runnning on freeBSD 5.4.

Experiment Configuration

Communications Configuration: One of the web servers was located locally the other was accessed via a DSL modem (SMC 3100) at a remote location.

DNS Servers: Both the Master and Slave named.conf files were modified by adding the following statement in the general options clause:

 rrset-order {order fixed;};

This was to ensure that the returned RR order was consistent to allow control of the experiment. Though it was noted that the log indicated that fixed was not fully implemented. Further if the RRset is being read via a DNS caching server this nameserver may change the order of returned RRs.

Zone File: The zone file (the ubiquitous example.com) was modified to include the following entries:

multiple  60 IN A 192.168.2.1 // first server
          60 IN A 192.168.2.2 // second sever

This would allow the various web services to be addressed via the name multiple.example.com to differentiate usage. The TTL was deliberately set to a low value to ensure the various caches were flushed frequently and we could observe the new DNS queries using ethereal. As it turned out due bizarre effects of the DSL modem configuration noted below this was a poor strategy.

Note: after restarting BIND issue a dig to multiple.example.com to confirm the order in which it supplies A RRs.

Web Servers: The following line was added to two <VirtualHost> definitions on two web servers whose location is defined by the IPs in the zone file. Different web sites were modified to that we could have early visual confirmation of the rollover effect i.e we modified example.com web site on one server and example.net on another so the visual content was different. This is not necessary but gave a quick visual indication (and a cheap thrill) on failover.

 ServerAlias multiple.example.com

The various web servers and name servers were re-started and the experiment was ready to begin.

Note: The first problem noted when testing the experimental setup was that a rogue DNS cache somewhere would frequently return an incorrect DNS response. It was not possible to locate the source of this cache since it lay in or behind the DSL configuration but some proxy software seemed to keep a short (~15 minutes) DNS cache. If a response was returned from this cache (which seemed to be present to minimise bandwidth usage or speed response times) the query for multiple.example.com was marked authoritative (!), contained only one A RR (fairly random) and with a TLL of 1 day!. The response should have been non-authoritative (generally), returned two A RRs and had a TTL of 60 seconds. To minimise problems the multiple.example.com TTLs were changed to 12 hours but clearing the local resolver cache still resulted in the rogue DNS response. The only solution was to wait for at leat 15 minutes between successive tests.

Experiment Methodology

All Windows 2000 based:

  1. Zap the local resolvers cache. From a command line issue ipconfig/flushdns. Confirm empty by issuing ipconfig/displaydns [> c:\temp\dns-cache-1.txt] (the only remaining items should be the contents of the host file which cannot be cleared).

  2. Load ethereal/wireshark and keep running throughout the session.

  3. Load browser (MSIE) and zap all its caches using standard commands (tools->Internet options->General tab-> delete temporary file, cookies and history)

  4. With both web sites operable, issue browser command http://multiple.example.com.

  5. run through a few pages for > 60 seconds (to observe the DNS behaviour). The local cache should time-out after 60 seconds so if pages are being accessed after this time and there is no DNS query in the ethereal log the browsers own cache must be used.

  6. make the current webserver inoperable (easier if you use different sites that resolve to the name mutiple.example.com). In the experiment the local site was always the initial web server and the LAN cable was removed to make it inoperable.

  7. Issue a page request and observe the behavior.

  8. When complete terminate the ethereal/wireshark session and save!

  9. Save the local cache for examination later if required using ipconfig/displaydns > c:\some-file.name

  10. Wait some period of time (see note above) and repeat for second browser

Experimental Results

MSIE: Swapped to the second site in 1 minute 37 seconds.

Firefox: Swapped to the second site in 1 minute 32 seconds.

Conclusions: Multiple A RRs work with both major browser families and rollver is relatively quick and painless (the time difference between rollovers, since the experimented was not repeated mutiple times, is regarded as insignificant). MSIE did not cause more DNS requests to be issued during the test (which lasted 7 - 10 minutes for both browsers) whereas mutiple DNS requests were issued during the Firefox run making it vulnerable to the rogue cache response. Indeed to get reliable results (due to the noted cache problem) the TTLs for multiple.example.com were changed to 12 hours.

After the failover both browsers stayed with the new location for all subsequent accesses. It was only possible to return to the original if it was made operable and the failover site made inoperable. That is both browsers will stay with any operable site until there is a good reason to change.

If only a single RR was defined Firefox took around 3 minutes to finally fail and MSIE around 2 minutes. In both cases this was longer than with multiple RRs.


Pro DNS and BIND by Ron Aitchison

Contents

tech info
guides home
dns articles
intro
contents
1 objectives
big picture
2 concepts
3 reverse map
4 dns types
quickstart
5 install bind
6 samples
reference
7 named.conf
8 dns records
operations
9 howtos
10 tools
11 trouble
programming
12 bind api's
security
13 dns security
bits & bytes
15 messages
resources
notes & tips
registration FAQ
dns resources
dns rfc's
change log