Sorry I haven't been posting. I'm working on some new things here, and have been REALLY busy on client work.
BTW: I am aware of at least one IBM person working on this problem. As a courtesy, I will not mention the person's name as they are not a frequent poster on this forum and may not want a bunch of emails. I'll pass along anything you have to add which confirms or contradicts whats in this note.
PLEASE read this, and if you can confirm the observations I've made, contradict any, or add thoughts to the theory or resolution ideas -- it would be of great value.
Definition of the issue:
Lotus Notes client, designer, admin, or other servers attempt to use a connection session which was established already but which has been idle for some period of time. The call to the other end across the connection goes unanswered, and a model wait period is imposed. On the client, this is the lightning bolt display, at the server its simply that task which is unavailable during the period. At the end of the connection timeout period on the client side, an error is reported indicating a failure to connect. In some circumstances, the client side will then attempt to re-establish one, often being successful. Sometimes, hitting crtl-break to interrupt the connection attempt early will have the same effect. This is because it forces the client side to drop any existing connections. The timing is important to the issue, but seems to vary based on some unknown variable. If its less than 5 minutes, I believe its another issue. Remember, other kinds of connection issues exist so not all connection issues will be this connection issue.
Some further observations:
This has been more common on Windows based servers. I've only seen it personally connecting to one non-windows based server. Over the last few days, on a Suse linux based 6.5.3 server. The problem in this case was an extreme case, and seems to have gone away after removing the SMB networking package from that machine (SAMBA) though I cannot be sure that this was the cause or that this was the same issue. The behavior was very much same.
I have seen this with much greater frequency -- and almost, though unconfirmed duplicability -- in cases where the Windows 2000 based Domino server had more than one network IP address AND the connection was being established to an address which was not the primary (first configured) network IP address.
I have seen similar looking, but unrelated issues due to bad firewall configurations (particularly when dealing with source routed packets, ip tunnels, vpns, and improper masquerade or NAT configurations). This is usually manifest as sessions that start by don't complete, and often leave multiple visible connection sessions on the server seen with the SH U command.
I have seen EXACTLY the same behavior in a connection from a DOMINO server to a Microsoft Active Directory LDAP port on another machine, when the Domino server as configured with Directory Assistance to use that LDAP server for http login credentials. In this case, if the session has been idle for some time, the first attempt to use it must time out and fail, then subsequent re-attempts work. This was discovered because the timeout in the ldap configuration was set to 60 seconds, and it was requiring 62 seconds for the first connection in morning to work. We are currently working around the issue by setting the ldap timeout to 15 seconds, and now the first connection in the morning takes 17 seconds -- which the users find acceptable.
I don't believe I've seen this issue when dealing with partitioned Domino servers, in cases where the partitions are defined to use distinct IP addresses as set in the notes.ini. I have not yet confirmed this in a lab.
A hypothesis and some suggestions for resolution of symptoms
I have a belief that at some point around 3 years ago, in an attempt to harden Win/32's TCP stack against what was a common denial of service attack of opening thousands of connections and never dropping them, that a change was made to the length of time that an idle connection was left open (or perhaps the total number of open connections on a port), and that a further change was made which in some way made the fact of the dropped listener unknown to the client the side. This may be simply by not sending a packet to the client indicating a drop, so that the client wouldn't immediately respond with a re-establish message. Another theory is that the OS changed the way it reports or handles multiple IP's on the same NIC which are on the same subnet when "talking" with the software (in this case, Domino). This may be causing the sever to respond but the response to go out slightly different down the stack and out the NIC, in a way that the client side does not see the response as being part of the same session. Its possible also that this is a change at the client side, and that while this change was always happening, the client side now is less accepting of packets outside what it expects to see. This would also be a defensive move.
To resolve, try the following:
a) Make sure your Domino server's "official" IP address -- the one your clients connect to (the internal one if its behind a firewall doing NAT) is the primary address on the only NIC in the server (if that's possible in your environment).
b) Configure even single partition Domino servers as if they are one of several partitions on the server, adding the proper parameters to the INI settings to specifically use an IP address rather than all the IP addresses reported by the operating system.
Comment Entry |
Please wait while your document is saved.
servers if there are lots of unanswered sessions then tcp/ip memory gets eaten
up and causes major slowdown of the stack.
This 'fix' in Windows 2003Sp1 might help explain it better then me :
http://www.microsoft.com/technet/community/columns/cableguy/cg1204.mspx#EBAA
but then again, I may be totally wrong here...