Site crashing issue

Clovertech Forums Cloverleaf Site crashing issue

  • Creator
    Topic
  • #122140
    Tim Jipson
    Participant

      I was would if anyone has had any issue with site being chronically unstable. I’ve seen this happen in a few different ways, today every site on the test engine looked like the image attached. All threads looked like they were running but the processes were displaying down and at the command line I could see that they were not running.

      I have also seen the opposite, threads and processes appeared down in the gui but the command line shows that everything is up and running.  This has plagued us for years and started with 6.1 running on AIX, we are now on 2022.09 running on Redhat and we are seeing the same issues. Many tickets have been opened with Infor support, no cause has ever been pinpointed.

      Attachments:
      You must be logged in to view attached files.
    Viewing 3 reply threads
    • Author
      Replies
      • #122142
        Vince Angulo
        Participant

          No experience with the first issue, but we used to see the second issue from time to time.  Haven’t seen it in years, but here’s what we have in our troubleshooting wiki:

          Entire site:

          The hcimonitord is hung.

          Use ps -ef | grep hcimonitord to get the pid, then kill -9 <pid>.  Restart the process with hcisitectl -s m.

          One process:

          The hcienginewatch for the process is not running so threads in that process show as ‘dead’ and will not start.  Threads are probably running, attempt to verify on target system.

          Use ps -ef | grep <process name>.  It should return two results the pid for hciengine and the pid for hcienginewatch.

          If there’s no hcienginewatch, the Network Monitor becomes unresponsive because it thinks the process is stopped, and won’t start because hciengine is still already running.

          Resolution is to kill -9 <pid> for the hciengine process, then restart the process from the left side of the Network Monitor as usual.

          ======================================

          A site init (db clean up) should be scheduled when possible

        • #122143
          Tim Jipson
          Participant

            Hi Vince,

            Sometimes that has helped. Sometimes I did a whole site db rebuild and there was no change. Sometimes I’d restart everything and have no immediate change but after 5min the gui starts responding correctly. The randomness of the issue has made troubleshooting almost impossible.

          • #122144
            Jason Russell
            Participant

              You should also be checking your monitord logs ($sitedir/exec/hcimonitord/hcimonitord.log). We had a site do this, the monitor daemon would crash causing the issues you described. the threads were still processing, but the gui would never load right. Ours was caused by an issue where alerts sending emails were stepping on each other’s toes, and would leave a file that the engine couldn’t do anything with, causing the monitord to fail.

              Sometimes restarting the monitord process would work, sometimes we’d have to kill. Sometimes it’d come back up on it’s own, but the pid file in the same directory as the log was not clearing out properly causing the ‘status’ to read incorrectly.

            • #122147
              Tim Jipson
              Participant

                Hi Jason,

                That makes a lot of sense, we do run a lot of script based alerts. I have spot checked the pid files but I haven’t done a full audit while the issue was occurring. That will be my next step, Thank you!

            Viewing 3 reply threads
            • You must be logged in to reply to this topic.