Is Killing Messages causing a problem? This is LONG, sorry

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Is Killing Messages causing a problem? This is LONG, sorry

  • Creator
    Topic
  • #48096
    Sallie Turner
    Participant

      We are running Cloverleaf v. 5.3 on HP Unix and have four working sites on the same box.  We have been having problems with one site where it is running fine, then it starts dumping certain messages to the error database and then it basically stops working and has to be killed and cleaned up to get it back working. We have suffered with this for about the past two weeks and are under a support contract with a vendor who cannot resolve our problem. I have a theory as to what is going on but I need some guidance from someone who understands Cloverleaf under the covers.

      Here is the sequence of events:

      Start site – all works fine

      About a day passes, then Orders and Cancel Orders start going to the error db. All processes on the site start logging database errors.

      Nothing will shut down nicely at this point. Kill everything, clean up, restart and all is fine.

      The messages that go to the error database show a tcl callout error. The thread that does the routing shows this in the error log:

      10/09/2005 20:24:31 [msi :msi :ERR /0:    sendafile] msiSectionLock: Can’t lock semaphore for thread sendafile: Too many open files

      10/09/2005 20:24:31 [msi :msi :ERR /0:    sendafile] msiExportStats: Can’t lock data section for thread sendafile

      Our Unix admin has looked at the box and noticed that there are a lot of kill processes popping in and out all the time.

      On this site, we use a tcl proc that reads in messages and either continues them or kills them depending on contents. We also use hcitpsmsgkill and kill_ob_save on the outbound threads on that site.

      So, here is the question (finally   🙂 )

      Does executing all these kills somehow “use up” all available semaphores on a site level?  This is what appears to be happening.  About 33,000 messages come in to the site daily. There are three processes running and 8 threads.

      Any insight would be greatly appreciated.

    Viewing 3 reply threads
    • Author
      Replies
      • #57618
        Anonymous
        Participant

          This may be off the mark (I’m not intimate with HPUX), but I’m just keying on the verbage, “too many open files”.  Is it possible that the tcl code does a file open, but never closes the file?

        • #57619
          Kathy Zwilling
          Participant

            This isn’t going to be alot of specific information but I am hoping it is enough to help some.

            We have had a couple of times, since we started with Cloverleaf, where we had similar messages and basically the Unix admin had to increase the # of open files possible in the Kernel parameters.  I can’t tell you the specific parameters he changed but maybe your system admin will have an idea.   If you look at the kernel parameter settings specified in the Cloverleaf installation documentation maybe that would help.

            Good luck!

          • #57620
            Sallie Turner
            Participant

              I don’t believe that it was the Unix OS, as I was running Glance for about 24 hours before the problem occurred and when the site crashed it showed:

                         Available         Used          Utilization        High %

              nproc        2054             168                8                9

              nfile         16394           2146               13               39

              shmmni      1024               21               2                 2

              msgmni          50               2                4                 4

              semmni         4096           145              4                 4

              nflocks            4096          23                 1               1

              npty              1024             0                 0                0

              nbuf                na            59262             na              na

            • #57621
              David Burks
              Participant

                Quote:

                On this site, we use a tcl proc that reads in messages and either continues them or kills them depending on contents. We also use hcitpsmsgkill and kill_ob_save on the outbound threads on that site.

                So, here is the question (finally    )

                Does executing all these kills somehow “use up” all available semaphores on a site level?

                To address this part of your post: hcitpsmsgkill acts upon a message, not a file and thus would have no impact on open file handles or semaphores.

                kill_ob_save only handles housekeeping of getting rid of a saved state14 message and clearing a memory variable.  Again, not an action that should be related to the errors you mention.  These items should not spawn processes that need killing and therefore should not be related to the kill processes noted by your unix admin.

                Your error referenced here mentions thread sendafile.  Is it always this thread?  If so look closer at that thread and any tcl code associated with it for possible leaks.  

                You might also turn the engine output for the associated thread/process up to enable_all and see if you get more information about what is happening leading up to the panic.  Be sure you have auto log cycling turned on as log will grow substantially.

            Viewing 3 reply threads
            • The forum ‘Cloverleaf’ is closed to new topics and replies.