Here is the sequence of events:
Start site – all works fine
About a day passes, then Orders and Cancel Orders start going to the error db. All processes on the site start logging database errors.
Nothing will shut down nicely at this point. Kill everything, clean up, restart and all is fine.
The messages that go to the error database show a tcl callout error. The thread that does the routing shows this in the error log:
10/09/2005 20:24:31 [msi :msi :ERR /0: sendafile] msiSectionLock: Can’t lock semaphore for thread sendafile: Too many open files
10/09/2005 20:24:31 [msi :msi :ERR /0: sendafile] msiExportStats: Can’t lock data section for thread sendafile
Our Unix admin has looked at the box and noticed that there are a lot of kill processes popping in and out all the time.
On this site, we use a tcl proc that reads in messages and either continues them or kills them depending on contents. We also use hcitpsmsgkill and kill_ob_save on the outbound threads on that site.
So, here is the question (finally
Does executing all these kills somehow “use up” all available semaphores on a site level? This is what appears to be happening. About 33,000 messages come in to the site daily. There are three processes running and 8 threads.
Any insight would be greatly appreciated.