Mysterious problem bogs down Engine

This topic has 6 replies, 4 voices, and was last updated 14 years, 9 months ago by James Cobane.

Creator

Topic
June 15, 2011 at 2:49 pm #52540
Mark Brown
Participant
Earlier this year we moved our interface engine from a dedicated Windows 2003 server to a virtual machine. We also upgraded to 5.7R3 of Cloverleaf.

A strange problem developed that during a 30 minute period, from 3:30am to 4:00am, threads would drop connection or a process would panic (or both). We’re running two sites and only one process on each site seems to be affected. Both are inbound processes. No other applications are running on the server at that time.

Using a sniffer we found that during this 30 minute period, while the hardware ACK was sent immediately, Cloverleaf would delay ACK’ing the message long enough for the sending system to time out. Then Cloverleaf would refuse any attempts by the sending system to reconnect until the thread (or many times the process) was bounced. I set up alerts to bounce the threads and/or processes and we got the sending systems to extend their time-outs. Monitoring the server during this time show that whole virutal box was being slowed down, even though the task monitor showed low CPU usage. We couldn’t find any cause for this slow down.

So we build a new interface engine on a beefed up virtual server, more memory, more CPU’s, etc. Everything runs great for one month. Then the mysterious problem returns, only the “window” has shifted by about 20 minutes later. The same two processes going down. The bouncing alerts and extended time-outs gets us through this 30 minute slow-down as the threads aren’t down long enough to really get noticed, but it’s still not good. Everything runs great for 23.5 hours of the day.

Has anyone else encountered a problem like this? Any suggestions on what might be causing it? Our network guys insist there are no backups or anything else going on at this time.
Creator

Topic

Viewing 5 reply threads

Author

Replies
- June 25, 2011 at 9:28 am #74624
  James Cobane
  Participant
  Mark,
  
  Does the sending system disconnect each time it sends or does it maintain a persistent connection? It sounds like you may want to make the connection a ‘multi-server port’ connection. If you do, you’ll need to modify your ACK proc to include the necessary DRIVERCTL info to allow Cloverleaf to ACK back on the appropriate client port…
  
  Jim Cobane
  
  Henry Ford Health
- June 27, 2011 at 1:10 am #74625
  Russ Ross
  Participant
  The symptoms you describe make me first want to ask someon if there is a virus scan or backup occuring on the physical VM box cloverleaf is running on during the time cloverleaf bogs down.
  
  I’ve also seen monitoring software that creates a daily report have similar adverse impact, so check that too.
  
  If it occurs at a regular time of day it might even become necessary to be there to see first hand what is happening with some of your own monitoring of system resources during the event horizon.
  
  If you do watch it first hand see if the number of processes spike which might of been okay on a dedicated box and not on VM.
  
  I once had this problem and was firing off 200 processes all at once due to me backgrounding jobs and looping ahead before any had finished.
  
  Russ Ross
  RussRoss318@gmail.com
- June 28, 2011 at 3:42 pm #74626
  Mark Brown
  Participant
  The sending system normally maintains a perisistant connection. After the time-out, the sending system just starts hammering the engine with connection requests and the engine returns with refusals. Only after the alert bounces the thread will it reconnect.
  
  I’ve watched the virtual server while this is going on. Everything slows down, files take a long time to open, some threads on the engine start queuing messages. When you look at the CPU usage, it looks normal.
  
  The network guys say there aren’t any backups or anything else running on the host server.
  
  I hope I don’t jinx it, but since posting the original message, the problem seems to have stopped even though it had been going on for a couple of months.
- June 28, 2011 at 7:45 pm #74627
  Peter Heggie
  Participant
  This sure looks like some background process is impacting resources, especially when it runs at the same time every day. I used to see that a lot also. In addition to virus scans and backups, there are also ‘system-level’ processes, or backups that could be running on other servers, that impact SAN response time – the problem could be in the storage or network side and not necessarily something running on your server. Processes running on other partitions or virtual images can easily impact your virtual server in ways other than the cpu and memory dedicated to your server. If there are storage response time measurement tools available, that might be something to check into.
  
  Peter Heggie
  PeterHeggie@crouse.org
- July 19, 2011 at 5:52 pm #74628
  Mark Brown
  Participant
  I thought I’d post a follow up that I hope doesn’t jinx things. For the past couple of weeks, this mysterious problem has gone away. I kept insisting, at our site, that something must be acting externally on the virtual box the interface engine was running. I kept being told that nothing was going on at the time so it had to be the engine, which made no sense at all.
  
  Well, when the SAN took down all the virtual servers, it became obvious what the problem was. Since the SAN has been patched and and restarted, the interface engine has been running perfectly.
- July 19, 2011 at 8:27 pm #74629
  James Cobane
  Participant
  As always, Cloverleaf is “guilty until proven innocent”. I don’t know how many times we get the fingers pointed at the engine, and then we prove that it isn’t a Cloverleaf issue. And more often than not, Cloverleaf finds the problem with the other system before they become aware of it (i.e. “Hey, vendor, is your system having problems? We’ve got data queueing on Cloverleaf…” Vendor: “Oh, yeah, it looks like our server is down….” ) My $.02…..
  
  Jim Cobane
  
  Henry Ford Health
Author

Replies

Viewing 5 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.