Is Killing Messages causing a problem? This is LONG, sorry

This topic has 4 replies, 4 voices, and was last updated 19 years, 8 months ago by David Burks.

Creator

Topic
October 20, 2005 at 5:25 pm #48096
Sallie Turner
Participant
We are running Cloverleaf v. 5.3 on HP Unix and have four working sites on the same box. We have been having problems with one site where it is running fine, then it starts dumping certain messages to the error database and then it basically stops working and has to be killed and cleaned up to get it back working. We have suffered with this for about the past two weeks and are under a support contract with a vendor who cannot resolve our problem. I have a theory as to what is going on but I need some guidance from someone who understands Cloverleaf under the covers.

Here is the sequence of events:

Start site – all works fine

About a day passes, then Orders and Cancel Orders start going to the error db. All processes on the site start logging database errors.

Nothing will shut down nicely at this point. Kill everything, clean up, restart and all is fine.

The messages that go to the error database show a tcl callout error. The thread that does the routing shows this in the error log:

10/09/2005 20:24:31 [msi :msi :ERR /0: sendafile] msiSectionLock: Can’t lock semaphore for thread sendafile: Too many open files

10/09/2005 20:24:31 [msi :msi :ERR /0: sendafile] msiExportStats: Can’t lock data section for thread sendafile

Our Unix admin has looked at the box and noticed that there are a lot of kill processes popping in and out all the time.

On this site, we use a tcl proc that reads in messages and either continues them or kills them depending on contents. We also use hcitpsmsgkill and kill_ob_save on the outbound threads on that site.

So, here is the question (finally 🙂 )

Does executing all these kills somehow “use up” all available semaphores on a site level? This is what appears to be happening. About 33,000 messages come in to the site daily. There are three processes running and 8 threads.

Any insight would be greatly appreciated.
Creator

Topic

Viewing 3 reply threads

Author

Replies
- October 20, 2005 at 7:31 pm #57618
  Anonymous
  Participant
  This may be off the mark (I’m not intimate with HPUX), but I’m just keying on the verbage, “too many open files”. Is it possible that the tcl code does a file open, but never closes the file?
- October 24, 2005 at 9:03 pm #57619
  Kathy Zwilling
  Participant
  This isn’t going to be alot of specific information but I am hoping it is enough to help some.
  
  We have had a couple of times, since we started with Cloverleaf, where we had similar messages and basically the Unix admin had to increase the # of open files possible in the Kernel parameters. I can’t tell you the specific parameters he changed but maybe your system admin will have an idea. If you look at the kernel parameter settings specified in the Cloverleaf installation documentation maybe that would help.
  
  Good luck!
- October 25, 2005 at 6:14 pm #57620
  Sallie Turner
  Participant
  I don’t believe that it was the Unix OS, as I was running Glance for about 24 hours before the problem occurred and when the site crashed it showed:
  
  Available Used Utilization High %
  
  nproc 2054 168 8 9
  
  nfile 16394 2146 13 39
  
  shmmni 1024 21 2 2
  
  msgmni 50 2 4 4
  
  semmni 4096 145 4 4
  
  nflocks 4096 23 1 1
  
  npty 1024 0 0 0
  
  nbuf na 59262 na na
- November 2, 2005 at 9:19 pm #57621
  David Burks
  Participant
  ~~Quote:~~
  
  On this site, we use a tcl proc that reads in messages and either continues them or kills them depending on contents. We also use hcitpsmsgkill and kill_ob_save on the outbound threads on that site.
  
  So, here is the question (finally )
  
  Does executing all these kills somehow “use up” all available semaphores on a site level?
  
  To address this part of your post: hcitpsmsgkill acts upon a message, not a file and thus would have no impact on open file handles or semaphores.
  
  kill_ob_save only handles housekeeping of getting rid of a saved state14 message and clearing a memory variable. Again, not an action that should be related to the errors you mention. These items should not spawn processes that need killing and therefore should not be related to the kill processes noted by your unix admin.
  
  Your error referenced here mentions thread sendafile. Is it always this thread? If so look closer at that thread and any tcl code associated with it for possible leaks.
  
  You might also turn the engine output for the associated thread/process up to enable_all and see if you get more information about what is happening leading up to the panic. Be sure you have auto log cycling turned on as log will grow substantially.
Author

Replies

Viewing 3 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.