Thread would NOT start after process panic

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Thread would NOT start after process panic

  • Creator
    Topic
  • #48178
    Rentian Huang
    Participant

      Greetings!!!

      – v5.2 on AIX

      One of my processes had a panic, then I run a clean up. But after I tried to bring up my process, the only thread that belongs to that process just NEVER started!!! (set to autostart) I couldn’t even bring the thread up manually.

      Then I tried to stop the process, but every time I tried hcienginerun,  it gives me the same error and, aftert 5 mins, the process stop.

      Code:


      hcienginestop -p adt_raw
      Trying hcicmd…
      No response within timeout — Assuming process is hung!
      Exiting.
      hcicmd failed!

      Now trying SIGINT…
      Now trying SIGKILL…
      Process ‘adt_raw’ is not running


      Running hciprocstatus give me this:

      Code:

      adt_raw         dead     Terminated by signal 0 at Thu Dec  1 14:44:34 2005


      I just re-run hcienginerun -p adt_raw and it panics again:

      Code:

      adt_raw         dead     Abnormal exit – Cloverleaf software panic at Thu D

      I have no problem with other processes in the same site. This happened after I tried to run a whole day worth data for testing.

      Can anyone give me some advice, thanks!

      Sam

    Viewing 6 reply threads
    • Author
      Replies
      • #57914
        James Cobane
        Participant

          Sam,

          Take a look at the process log for the adt_raw process; it may give you a better clue as to what is happening.  One quick thing that you can try initially is to run the ‘hcilmclear’ command on that process:

          hcilmclear -p adt_raw

          This does some clean-up work for the process after a panic; then try to re-start the process.

          Hope this helps.

          Jim Cobane

          Henry Ford Health

        • #57915
          Anonymous
          Participant

            Also check the recovery database to see if there is any message causing the problem.

          • #57916
            James Cobane
            Participant

              Carlos raises a very good point.  Sometimes you may see a message with a state of 0 which will cause the associated process to crash until it is removed from the recovery database.  Not sure what will cause a message to go to state 0, but it is likely related to something going awry in a tcl proc.

              Jim Cobane

              Henry Ford Health

            • #57917
              garry r fisher
              Participant

                Hi Sam,

                I had a similar problem and asked Quovadx to dial in and look at it for me. They found a message in the recovery database in the wrong state and simply deleted this and we were able to start everthing backup again.

                Regards

                Garry

              • #57918
                Rentian Huang
                Participant

                  Thanks for all your responses!

                  Garry, I have heard Greg Day told us the same scenario you discribed.

                  I did open the log and found around 1000+ repetition of the following:

                  Code:

                  [msg :Msg :INFO/0:adt_raw_xlate] [0.0.8159727] Updating the recovery database
                  [xlt :thre:INFO/1:adt_raw_xlate] [0.0.8159729] Requeuing: 1133462238.6509/xlate post

                  Since I am working on the test site, there are too many msgs in the rdb. I took a rough look at them and seems they all in state 7.

                  I will do a clearup again plus blowing away all msgs in the rdb, and see what happens…

                  Sam  8)

                • #57919
                  Rentian Huang
                  Participant

                    Hi all,

                    I did a hcidbinit -A and everything back to normal!!

                    I guess the tricky thing here is if we were in production, how would we identify the msgs that cause the problem and how to fix it, we can’t just blow away all msgs in the rdb… Maybe base on their states as what James said???

                    Garry, can you tell us more on the wrong state you mentioned?

                    Thanks again!

                    Sam

                  • #57920
                    Anonymous
                    Participant

                      Being very careful with production I would sort the dbdump message list and find the top message, dump it into a file and run it through the appropriate test tool.  The problem message is probably one of the currently being processed messages.  Then delete the bad one from the db, you still have it in a file to mess with later.

                  Viewing 6 reply threads
                  • The forum ‘Cloverleaf’ is closed to new topics and replies.