I've had trouble off and on with a couple of Domino servers on linux. The server goes pear shaped and when I ssh in and look at the console, Domino is reporting drive errors. If you attempt to do anything on the OS at all, you quickly see that the whole file system has shifted into a "read-only" state. This is a bit like a car with a transmission problem shifting into "limp-home" mode. Needless to say, Domino doesn't like being unable to write to the disk.
It has happened to me specifically with the most recent updates of CENTOS5, but since that's the only distribution I use I can't tell you that it's specifically related to the distro. I don't think so, because I've seen reports of others with similar issues. I also know that it isn't Domino's fault, but rather that Domino is so disk intensive that it tends to be one of the places where the problem comes up.
The problem manifests when the disk is so busy that at some point the driver just can't keep up. When this happens in the Windows server world, either Domino will crash or the entire OS will just halt. Usually people think this is a RAID controller problem and start replacing hardware. In fact, it's just the driver reporting an error state to the OS that it can't keep up and the OS reacting badly. On linux, the ext3 file system (roughly equivalent to the ntfs file system in Windows) will react to a any write fault based on an option stored in the superblock. The options are "continue", which will ignore the problem and just keep chugging along; "remount-ro" which will cause the file system to remount in a read-only state; and "panic" which will essential crash the OS and reboot.
Generally speaking, the default mode is the best for most important servers. It is the most likely to have no ill effects on existing data. It will stop the server from doing anything new, however. The option to "panic" is never good. Rebooting the OS with a drive that's reporting problems is at best going to send it into a lengthy file system check, and if the problem is serious could mean the drive will never come back up at all. Since I have plenty of redundancy throughout the environment, I decided to give the "continue" option a try. You can alter the setting using "tune2fs" (e.g. $ sudo tune2fs /dev/sda1 -e continue ).
What's interesting, and purely anecdotal at this point, is that disk i/o on this machine is now performing far better, even without any errors. I'll be keeping an eye on this over the next few days and let you know if that changes. It is strange though.
Comment Entry |
Please wait while your document is saved.
definitely want to give that a shot. Background info:
http://www.howtoforge.com/reducing-disk-io-by-mounting-partitions-with-noatime