A strange problem developed that during a 30 minute period, from 3:30am to 4:00am, threads would drop connection or a process would panic (or both). We’re running two sites and only one process on each site seems to be affected. Both are inbound processes. No other applications are running on the server at that time.
Using a sniffer we found that during this 30 minute period, while the hardware ACK was sent immediately, Cloverleaf would delay ACK’ing the message long enough for the sending system to time out. Then Cloverleaf would refuse any attempts by the sending system to reconnect until the thread (or many times the process) was bounced. I set up alerts to bounce the threads and/or processes and we got the sending systems to extend their time-outs. Monitoring the server during this time show that whole virutal box was being slowed down, even though the task monitor showed low CPU usage. We couldn’t find any cause for this slow down.
So we build a new interface engine on a beefed up virtual server, more memory, more CPU’s, etc. Everything runs great for one month. Then the mysterious problem returns, only the “window” has shifted by about 20 minutes later. The same two processes going down. The bouncing alerts and extended time-outs gets us through this 30 minute slow-down as the threads aren’t down long enough to really get noticed, but it’s still not good. Everything runs great for 23.5 hours of the day.
Has anyone else encountered a problem like this? Any suggestions on what might be causing it? Our network guys insist there are no backups or anything else going on at this time.