Site crashing issue

This topic has 6 replies, 3 voices, and was last updated 3 months, 2 weeks ago by Jason Russell.

Creator

Topic
September 3, 2025 at 11:30 am #122140
Tim Jipson
Participant
I was would if anyone has had any issue with site being chronically unstable. I’ve seen this happen in a few different ways, today every site on the test engine looked like the image attached. All threads looked like they were running but the processes were displaying down and at the command line I could see that they were not running.

I have also seen the opposite, threads and processes appeared down in the gui but the command line shows that everything is up and running. This has plagued us for years and started with 6.1 running on AIX, we are now on 2022.09 running on Redhat and we are seeing the same issues. Many tickets have been opened with Infor support, no cause has ever been pinpointed.

Attachments:
You must be logged in to view attached files.
Creator

Topic

Viewing 5 reply threads

Author

Replies
- September 3, 2025 at 11:51 am #122142
  Vince Angulo
  Participant
  No experience with the first issue, but we used to see the second issue from time to time. Haven’t seen it in years, but here’s what we have in our troubleshooting wiki:
  
  Entire site:
  
  The hcimonitord is hung.
  
  Use ps -ef | grep hcimonitord to get the pid, then kill -9 <pid>. Restart the process with hcisitectl -s m.
  
  One process:
  
  The hcienginewatch for the process is not running so threads in that process show as ‘dead’ and will not start. Threads are probably running, attempt to verify on target system.
  
  Use ps -ef | grep <process name>. It should return two results the pid for hciengine and the pid for hcienginewatch.
  
  If there’s no hcienginewatch, the Network Monitor becomes unresponsive because it thinks the process is stopped, and won’t start because hciengine is still already running.
  
  Resolution is to kill -9 <pid> for the hciengine process, then restart the process from the left side of the Network Monitor as usual.
  
  ======================================
  
  A site init (db clean up) should be scheduled when possible
- September 3, 2025 at 12:27 pm #122143
  Tim Jipson
  Participant
  Hi Vince,
  
  Sometimes that has helped. Sometimes I did a whole site db rebuild and there was no change. Sometimes I’d restart everything and have no immediate change but after 5min the gui starts responding correctly. The randomness of the issue has made troubleshooting almost impossible.
- September 3, 2025 at 1:50 pm #122144
  Jason Russell
  Participant
  You should also be checking your monitord logs ($sitedir/exec/hcimonitord/hcimonitord.log). We had a site do this, the monitor daemon would crash causing the issues you described. the threads were still processing, but the gui would never load right. Ours was caused by an issue where alerts sending emails were stepping on each other’s toes, and would leave a file that the engine couldn’t do anything with, causing the monitord to fail.
  
  Sometimes restarting the monitord process would work, sometimes we’d have to kill. Sometimes it’d come back up on it’s own, but the pid file in the same directory as the log was not clearing out properly causing the ‘status’ to read incorrectly.
- September 3, 2025 at 2:08 pm #122147
  Tim Jipson
  Participant
  Hi Jason,
  
  That makes a lot of sense, we do run a lot of script based alerts. I have spot checked the pid files but I haven’t done a full audit while the issue was occurring. That will be my next step, Thank you!
- November 10, 2025 at 12:13 pm #122219
  Tim Jipson
  Participant
  We have tracked this issue down to a memory leak in CloverleafHostServer. I’ve seen the host sever use over 5GB over the course of a month, restarting the host server drops memory usage to under 1GB. Since we’ve been doing weekly host server restarts, we have not had any site crashing issues.
- November 10, 2025 at 12:28 pm #122221
  Jason Russell
  Participant
  Have you tracked it down to a process? The only time I’ve seen cloverleaf leak memory like that is when you have folder being processed with fileset-local and it wasn’t cleaning itself out (or looking too quickly) and eating RAM. Clearing the folder almost immediately cleared the memory issue.
Author

Replies

Viewing 5 reply threads

You must be logged in to reply to this topic.

Site crashing issue

Attachments: