Background
We have two Cloverleaf environments (5.5 rev 1), production and development, that run on the same version of AIX (5.3). We have two Cerner environments, production and development, that run on the same version of AIX (5.3).
We have a number of FTP processes that get files from our Cerner server and put them on foreign systems. These processes were configured before I assumed this position, and have not presented any issues.
For this reason, when I was tasked with creating a new FTP job from Cerner, I copied one of the existing FTP threads and modified the information for the new directory and files. In our development system, this worked flawlessly.
However, when I performed the same process in production, it brought the server down after several days of seemingly normal operation. The server administrator identified that the Cloverleaf box had over 30,000 sockets open, all sitting in a CLOSE_WAIT status on his side. Killing the thread closed all of the sockets, but it opens another port every 60 seconds while it is running.
On the Cloverleaf box, the connections sit in a FIN_WAIT_2 status, which times out after around 10 minutes. The CLOSE_WAIT status on the other box is not able to timeout, and sits there until the FTP connection is closed. Throughout all of this, the files are successfully transferred.
Troubleshooting
Workarounds that did not work-
Copied the working thread in test and modified it to connect to prod.
Created a new thread from scratch.
Removed the Directory Parse and route tcl procs.
Tried modifying the specs on the thread – Changed style (single, eof, etc), read interval, etc.
Observations
There is one thing that stands out to me when this thread is running. Whenever it grabs a file, it goes to an Up status, opening a new port. After around 30 seconds (the scan interval), it goes back to Opening. Around 30 seconds later (when there is another file), it goes to Up, again.
The thread that I copied and is connected to the development Cerner box stays up. I think that this is the crux of the issue that we’re experiencing. Can anyone think of a reason that an inbound FTP thread (that gets the file) would be closing and reopening the connection? Given that we have other, similar FTP jobs that connect to the server without any issue and stay Up, there appears to be something amiss with this thread.
I appreciate any insight that anyone can provide into this issue.
TL;DR version