Mysterious problem bogs down Engine

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Mysterious problem bogs down Engine

  • Creator
    Topic
  • #52540
    Mark Brown
    Participant

      Earlier this year we moved our interface engine from a dedicated Windows 2003 server to a virtual machine. We also upgraded to 5.7R3 of Cloverleaf.  

      A strange problem developed that during a 30 minute period, from 3:30am to 4:00am, threads would drop connection or a process would panic (or both).  We’re running two sites and only one process on each site seems to be affected.  Both are inbound processes.  No other applications are running on the server at that time.

      Using a sniffer we found that during this 30 minute period, while the hardware ACK was sent immediately, Cloverleaf would delay ACK’ing the message long enough for the sending system to time out.  Then Cloverleaf would refuse any attempts by the sending system to reconnect until the thread (or many times the process) was bounced.  I set up alerts to bounce the threads and/or processes and we got the sending systems to extend their time-outs. Monitoring the server during this time show that whole virutal box was being slowed down, even though the task monitor showed low CPU usage.  We couldn’t find any cause for this slow down.

      So we build a new interface engine on a beefed up virtual server, more memory, more CPU’s, etc.  Everything runs great for one month.  Then the mysterious problem returns, only the “window” has shifted by about 20 minutes later.  The same two processes going down.  The bouncing alerts and extended time-outs gets us through this 30 minute slow-down as the threads aren’t down long enough to really get noticed, but it’s still not good. Everything runs great for 23.5 hours of the day.

      Has anyone else encountered a problem like this?  Any suggestions on what might be causing it?  Our network guys insist there are  no backups or anything else going on at this time.

    Viewing 5 reply threads
    • Author
      Replies
      • #74624
        James Cobane
        Participant

          Mark,

          Does the sending system disconnect each time it sends or does it maintain a persistent connection?  It sounds like you may want to make the connection a ‘multi-server port’ connection.  If you do, you’ll need to modify your ACK proc to include the necessary DRIVERCTL info to allow Cloverleaf to ACK back on the appropriate client port…

          Jim Cobane

          Henry Ford Health

        • #74625
          Russ Ross
          Participant

            The symptoms you describe make me first want to ask someon if there is a virus scan or backup occuring on the physical VM box cloverleaf is running on during the time cloverleaf bogs down.

            I’ve also seen monitoring software that creates a daily report have similar adverse impact, so check that too.

            If it occurs at a regular time of day it might even become necessary to be there to see first hand what is happening with some of your own monitoring of system resources during the event horizon.

            If you do watch it first hand see if the number of processes spike which might of been okay on a dedicated box and not on VM.

            I once had this problem and was firing off 200 processes all at once due to me backgrounding jobs and looping ahead before any had finished.

            Russ Ross
            RussRoss318@gmail.com

          • #74626
            Mark Brown
            Participant

              The sending system normally maintains a perisistant connection.  After the time-out, the sending system just starts hammering the engine with connection requests and the engine returns with refusals.  Only after the alert bounces the thread will it reconnect.

              I’ve watched the virtual server while this is going on.  Everything slows down, files take a long time to open, some threads on the engine start queuing messages.  When you look at the CPU usage, it looks normal.

              The network guys say there aren’t any backups or anything else running on the host server.

              I  hope I don’t jinx it, but since posting the original message, the problem seems to have stopped even though it  had been going on for a couple of months.

            • #74627
              Peter Heggie
              Participant

                This sure looks like some background process is impacting resources, especially when it runs at the same time every day. I used to see that a lot also. In addition to virus scans and backups, there are also ‘system-level’ processes, or backups that could be running on other servers, that impact SAN response time – the problem could be in the storage or network side and not necessarily something running on your server. Processes running on other partitions or virtual images can easily impact your virtual server in ways other than the cpu and memory dedicated to your server. If there are storage response time measurement tools available, that might be something to check into.

                Peter Heggie
                PeterHeggie@crouse.org

              • #74628
                Mark Brown
                Participant

                  I thought I’d post a follow up that I hope doesn’t jinx things. For the past couple of weeks, this mysterious problem has gone away. I kept insisting, at our site, that something must be acting externally on the virtual box the interface engine was running.  I kept being told that nothing was going on at the time so it had to be the engine, which made no sense at all.

                  Well, when the SAN took down all the virtual servers, it became obvious what the problem was.  Since the SAN has been patched and and restarted, the interface engine has been running perfectly.

                • #74629
                  James Cobane
                  Participant

                    As always, Cloverleaf is “guilty until proven innocent”.  I don’t know how many times we get the fingers pointed at the engine, and then we prove that it isn’t a Cloverleaf issue.  And more often than not, Cloverleaf finds the problem with the other system before they become aware of it (i.e. “Hey, vendor, is your system having problems?  We’ve got data queueing on Cloverleaf…”  Vendor:  “Oh, yeah, it looks like our server is down….” )  My $.02…..

                    Jim Cobane

                    Henry Ford Health

                Viewing 5 reply threads
                • The forum ‘Cloverleaf’ is closed to new topics and replies.