A pair of clustered servers behind a network dispatcher in your building is easy to manage. What about failover machines that aren’t located in the same city?
I may be a day late and a dollar short, as they say, but I’ve spent a good bit of the last couple of days implementing a new server at a data center in Dallas. The new server will act as a hot standby. While Domino has built in cluster replication to keep itself up to date, there are many other aspects of a complex system to think about. Here are some of the things I’ve done to manage a true failover environment.
IP Address Failover – There are a number of ways to do this, of course, but none are perfect.
You can spend a great deal of money and use something like BGP or another public network failover solution so that your actual IP addresses get re-routed to the other data center. That’s expensive but very effective.
A common approach is to use “Round Robin DNS” but that has its own issues. It isn’t really very accurate because there’s no specific requirement that says a local DNS client has to pick the top address to try first, or if it has to try the second at all. Some client software even sorts the returned list of entries by the number of network hops away and picks the closest.
You can use an address dispatching tool like Domino’s ICM. ICM is a good tool – and there are other products like it that are more advanced – but of course you have to worry about the ICM itself staying available and that gets into further redundancy issues.
Finally, there’s the simple expedient of simply changing the IP address associated with the DNS entry itself and using a very low TTL value so the client side has to re-request the address frequently. That’s effective but it requires intervention to make the changeover happen, and it means that most of the time you’re getting excessive DNS request hits and a slightly longer initial connection time because you’ve effectively disabled any caching that the DNS networking can do for you.
The best solution, of course, is to build failover directly into the application side. This won’t work so well for web servers, but for proprietary client software – like Lotus Notes, it works very well.
For now, I’m using the DNS change method with a very low TTL, and working on first creating an automatically scripted DNS updater that will take care of doing the update when the primary sever fails, and also working on a longer term solution which is to build secondary connection addresses directly into the client side software for Second Signal.
Data Synchronization of Non-Domino (On Disk) Data
Keeping data in Domino’s NSF files up to date is easy with cluster replication. Keeping on disk files such as those in your …/domino/html/ directory up to date is much trickier. I’m using a free tool called Unison to do this. Unison works a lot like Domino replication. It builds meta data about files in its own internal database, so that file comparisons are done with hash algorithms and updates are well managed in both directions. In Linux, I’ve built a script to handle this task. This script runs periodically as a cron job, and keeps the Domino html directory as well as a bunch of other audio directories and many configuration settings files in sync on the failover.
In addition to the data itself – html data, sound files, and so on – I use Unison to keep many configuration files up to date on both machines. Even though the failover machine is setup as a slave DNS, I keep the full set of zone definitions fully up to date on it. If I do have a long term outage, I can just change the settings in named.conf to make the failover a master and I haven’t lost any recent updates. This method also keeps my asterisk configurations and scripts in sync automatically.
Machine Specific Configurations & Scripts
Some of the scripting I do – especially within Asterisk AGI scripts – has to make http or web services calls to the local Domino server for data. Since I want to keep the scripts in sync, I don’t want to hardcode the local server’s IP address. There are a number of ways to handle this. Traditional script development can grab the servers HOSTNAME environment variable, but since I tie specific IP addresses on a multi-homed machine to different Domino partitions and sites, that’s not good enough. I can set a local environment variable on the shell for each machine, or I can use the /etc/hosts file. My scripted URL’s can then use generic name for the server portion of the URL.
Some configurations are machine specific, or are different if the machine is acting as a primary or failover machine. For example, in Asterisk I use two different carriers for local telephone numbers. The better of the two will connect to my machine by its IP address, and will automatically fail-over to a second IP address if the first doesn’t connect. That’s easy. The other provider requires that the server “register” and update its registration every 60 seconds, saying “hey, I’m over here”. If two machines are trying to register, the resulting incoming call will go to which ever was last. To manage this, I use an “Include” on the configuration file for those registrations in Asterisk (in this case, sip.conf and iax.conf). This way, the full configuration file can be synchronized, but the part with the registration configuration can be stored in a non-synchronized directory and thus be different on each machine.
What’s left to do?
I’ve got the new machine in place and running. The data synchronization is up and active as well now. What remains is for me to test, test, and test – and then build the automatic scripts for detecting a failover on the primary and automatically cutting over to the secondary by performing the following actions:
1. Update the local configuration “include” files to change their state to the “primary” configuration.
2. Update the DNS zone file and notify the secondaries.
3. Update the VoIP provider to failover to the secondary machine automatically.
4. Make sure that all the remote software clients are using DNS entries and not IP addresses or host files.
Longer term, I hope to consolidate much of this. Building auto-failover to a secondary dns name into the client software, and creating a single “Localized” configuration directory on the servers that includes everything not automatically kept in sync along with a “README” file with a checklist.
What about you? Is your failover plan up to date?
Comment Entry |
Please wait while your document is saved.
Your a geek! :)
Mike