Andrew Pollack's Blog

Technology, Family, Entertainment, Politics, and Random Noise

Professional Services

Second Signal

Presentations

Andrew's Blog

Support

Wow, I've been busy. Here's a meaty tech post, about the intermittent time-out issue some of us have seen in Domino/Notes communication

By Andrew Pollack on 01/06/2005 at 08:29 AM EST

Sorry I haven't been posting. I'm working on some new things here, and have been REALLY busy on client work.

BTW: I am aware of at least one IBM person working on this problem. As a courtesy, I will not mention the person's name as they are not a frequent poster on this forum and may not want a bunch of emails. I'll pass along anything you have to add which confirms or contradicts whats in this note.

PLEASE read this, and if you can confirm the observations I've made, contradict any, or add thoughts to the theory or resolution ideas -- it would be of great value.

Definition of the issue:

Lotus Notes client, designer, admin, or other servers attempt to use a connection session which was established already but which has been idle for some period of time. The call to the other end across the connection goes unanswered, and a model wait period is imposed. On the client, this is the lightning bolt display, at the server its simply that task which is unavailable during the period. At the end of the connection timeout period on the client side, an error is reported indicating a failure to connect. In some circumstances, the client side will then attempt to re-establish one, often being successful. Sometimes, hitting crtl-break to interrupt the connection attempt early will have the same effect. This is because it forces the client side to drop any existing connections. The timing is important to the issue, but seems to vary based on some unknown variable. If its less than 5 minutes, I believe its another issue. Remember, other kinds of connection issues exist so not all connection issues will be this connection issue.

Some further observations:

This has been more common on Windows based servers. I've only seen it personally connecting to one non-windows based server. Over the last few days, on a Suse linux based 6.5.3 server. The problem in this case was an extreme case, and seems to have gone away after removing the SMB networking package from that machine (SAMBA) though I cannot be sure that this was the cause or that this was the same issue. The behavior was very much same.

I have seen this with much greater frequency -- and almost, though unconfirmed duplicability -- in cases where the Windows 2000 based Domino server had more than one network IP address AND the connection was being established to an address which was not the primary (first configured) network IP address.

I have seen similar looking, but unrelated issues due to bad firewall configurations (particularly when dealing with source routed packets, ip tunnels, vpns, and improper masquerade or NAT configurations). This is usually manifest as sessions that start by don't complete, and often leave multiple visible connection sessions on the server seen with the SH U command.

I have seen EXACTLY the same behavior in a connection from a DOMINO server to a Microsoft Active Directory LDAP port on another machine, when the Domino server as configured with Directory Assistance to use that LDAP server for http login credentials. In this case, if the session has been idle for some time, the first attempt to use it must time out and fail, then subsequent re-attempts work. This was discovered because the timeout in the ldap configuration was set to 60 seconds, and it was requiring 62 seconds for the first connection in morning to work. We are currently working around the issue by setting the ldap timeout to 15 seconds, and now the first connection in the morning takes 17 seconds -- which the users find acceptable.

I don't believe I've seen this issue when dealing with partitioned Domino servers, in cases where the partitions are defined to use distinct IP addresses as set in the notes.ini. I have not yet confirmed this in a lab.

A hypothesis and some suggestions for resolution of symptoms

I have a belief that at some point around 3 years ago, in an attempt to harden Win/32's TCP stack against what was a common denial of service attack of opening thousands of connections and never dropping them, that a change was made to the length of time that an idle connection was left open (or perhaps the total number of open connections on a port), and that a further change was made which in some way made the fact of the dropped listener unknown to the client the side. This may be simply by not sending a packet to the client indicating a drop, so that the client wouldn't immediately respond with a re-establish message. Another theory is that the OS changed the way it reports or handles multiple IP's on the same NIC which are on the same subnet when "talking" with the software (in this case, Domino). This may be causing the sever to respond but the response to go out slightly different down the stack and out the NIC, in a way that the client side does not see the response as being part of the same session. Its possible also that this is a change at the client side, and that while this change was always happening, the client side now is less accepting of packets outside what it expects to see. This would also be a defensive move.

To resolve, try the following:

a) Make sure your Domino server's "official" IP address -- the one your clients connect to (the internal one if its behind a firewall doing NAT) is the primary address on the only NIC in the server (if that's possible in your environment).

b) Configure even single partition Domino servers as if they are one of several partitions on the server, adding the proper parameters to the INI settings to specifically use an IP address rather than all the IP addresses reported by the operating system.

There are - loading - comments....

SYN-ACK AttackBy Declan Lynch on 01/06/2005 at 09:32 AM EST

Sounds like a accidental attack of the SYN-ACK exploit. On certain Windows
servers if there are lots of unanswered sessions then tcp/ip memory gets eaten
up and causes major slowdown of the stack.

This 'fix' in Windows 2003Sp1 might help explain it better then me :
http://www.microsoft.com/technet/community/columns/cableguy/cg1204.mspx#EBAA

but then again, I may be totally wrong here...

I don't think its an accidental attack, I think its....By Andrew Pollack on 01/06/2005 at 11:28 AM EST

a result of the service pack (or a service pack) using a common technique to
silently drop rather than announcing an end connection.

My own thoughts on this are...By Amy B on 01/06/2005 at 10:21 AM EST

I use hostnames wherever possible rather than IP addresses. I have A records
internally in my DNS so that the hostname maps properly to the Domino server's
internal non-routable IP address when I'm on the LAN. When I'm traveling, I
use the same connection docs, etc. but the A records resolve to public DNS A
records which provide the server's external, public IP address. Clean and
simple and easy to modify.

- A

Right, that common and all -- but this issue isn't about resolution.By Andrew Pollack on 01/06/2005 at 11:30 AM EST

This is a valid resolved link, that intermittenly drops but does so in a way
that the client side doesn't know it. So, the client side tries to use the
link, but has to timeout and fail then reestablish on.

Other ReferencesBy Julian Robichaux on 01/06/2005 at 11:45 AM EST

Other references to this issue (I think it's the same one):

http://www.edbrill.com/ebrill/edbrill.nsf/dx/09162004093630PMEBR53V.htm?opendocu
ment&comments

http://www-10.lotus.com/ldd/nd6forum.nsf/55c38d716d632d9b8525689b005ba1c0/51002e
4d8eb72bdf85256eef00269f7d?OpenDocument

One Other Follow-Up On Ed's SiteBy Julian Robichaux on 01/06/2005 at 11:47 AM EST

http://www.edbrill.com/ebrill/edbrill.nsf/dx/09212004093613PMEBR53P.htm?opendocu
ment&comments

My own thoughts on this are...By duanebear on 01/07/2005 at 08:03 AM EST

I have been working on this with IBM for the past two months. This issue or
one close to it has been addressed in Domino 6.5.4. The SPR is RGET5Q5TJL. It
appears to be a problem with the IOCP interface in Domino. It can "sleep" for
4 minutes. Check out the SPR and see if this matches what you are
experiencing. We are seeing it while running Domino 6.5.1 and 6.5.3 on AIX
specifically. Duane

Infor on some known issuesBy Ted Stanton on 01/07/2005 at 10:04 AM EST

SPR# IDEA5VSS27 - Resolved/Fixed in 6.5.3 - Notes 6.5 client pages out after 60
seconds of inactivity

SPR# JEIN5XSLJP - Resolved/Fixed in 6.5.3 - Poor performance of in-memory
design cache

SPR# SVRO63SNKW - Resloved/Fixed in 6.5.4 - 1MB in-memory design cache isn't
big enough

SPR# BSPR65FJ2R - Resolved/Fixed in 6.5.3 FP1 - Chronos: Error full text
indexing mail\xxxx.nsf: Message Queue is full

Above are some SPR's related to client side performance.

Subject
Your Name
Homepage
*Your Email
	* Your email address is required, but not displayed.

Your thoughts....


	Remember Me

Andrew Pollack's Blog

Wow, I've been busy. Here's a meaty tech post, about the intermittent time-out issue some of us have seen in Domino/Notes communication

Site Links

Useful Links

Other Recent Stories...