Receiving HCIMONITORD messages

This topic has 13 replies, 7 voices, and was last updated 15 years, 9 months ago by mike brown.

Creator

Topic
September 21, 2009 at 1:39 pm #51193
mike brown
Participant
Anyone ever seen these alerts before :

This is happening in our TEST and PROD environment, they are on separate servers.

We are currently experiencing “hcimonitord” problems with the monitor hanging and not displaying the status, the processes and threads are not refreshing in a timely manner, up to 2 – 3 minutes to refresh.

I found in the hcimonitord logs :

****

[icl :tcpi:ERR /0: hcimonitord:09/20/2009 02:39:13] write failed: Broken pipe

[cmd :cmd :INFO/0: hcimonitord:09/20/2009 02:39:13] Inrecoverable socket error. Closing connection.

[aler:aler:INFO/0: hcimonitord:09/20/2009 02:39:13] Removing alerts and wants for connection 0x20d29398

****

I have bounced the server, cleaned up the DB and ran our bounce/cleanup scripts, they run once a month and the issue is still occurring, any help is greatly appreciated.

mike
Creator

Topic

Viewing 12 reply threads

Author

Replies
- September 21, 2009 at 1:53 pm #69138
  James Cobane
  Participant
  Did you simply cycle the monitor daemon (is that part or your clean-up scripts)? I can’t say that I’ve seen that specific error, but generally if your having an issue with the monitor refreshing, cycling the monitor daemon resolves it (hcisitectl -k m; hcisitectl -s m)
  
  Jim Cobane
  
  Henry Ford Health
- September 21, 2009 at 1:55 pm #69139
  Tom Rioux
  Participant
  Mike,
  
  We had something similar happen to us last week. I did the same things you described to no avail. Finally, I ran a ps -ef and checked for any processes that may be listed twice. Sure enough, we had two processes that were out there twice. How or why that happened is beyond me. I killed off the duplicate processes, brought everything back up and all seemed to function normal again.
  
  You may want to take a look at that to see if that may be your issue.
  
  Hope this helps….Tom
- September 21, 2009 at 2:02 pm #69140
  Tom Rioux
  Participant
  Hey Jim,
  
  In our case, it wouldn’t let us run the normal clean up scripts. It was simply hanging up. This is what I did here:
  
  1. Run the hcisitectl command with the -f command.
  
  2. Go to each process directory and remove the “pid”, if present
  
  3. Go to each daemon directory and remove the “pid”, if present
  
  4. Run and “ps -ef” and grep for the site name to see if any processes are still listed.
  
  In our case, we still had two processes listed, even though they didn’t appear to be running and the pid was removed from the process directory. After killing off the two processes and bringing everything back up, all seemed to process okay.
  
  Thanks….Tom
- September 21, 2009 at 3:08 pm #69141
  Jim Kosloskey
  Participant
  Mike and Tom,
  
  I am just curious – do you have any TCP/IP ports in the ephemereal range assigned to any integration threads?
  
  Also what release are you both on?
  
  Thanks.
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- September 21, 2009 at 3:38 pm #69142
  Tom Rioux
  Participant
  Jim,
  
  We do have a handful of our outbound interfaces that are within the ephemeral ranges. They are port numbers that were assigned for us by the Portal system and are spread out across our various sites. None of these ports were in either of the duplicated processes I spoke of above.
  
  We are 5.6.2.
  
  Thanks…Tom
- September 24, 2009 at 4:25 pm #69143
  Troy Morton
  Participant
  Can you explain a little more what the -f option does on hcisitectl?
  
  I have always thought you should never remove any pid files or start/stop lock manager while engine processes may be running.
- September 25, 2009 at 3:01 pm #69144
  Deborah Ingram
  Participant
  CL 5.5: We are having similar issues where the GUI takes a very long time to update changes/pstarts/pstops/etc. We followed the directions above, but we do not have any duplicate processes. In addition, we tried doing the sitecleanup last night, and reinitialized all the databases with hcidbinit -AC but everything is still very slow from the GUI.
  
  Each time the monitord is started we get the following in our .err file:
  
  ~~Quote:~~
  
  [icl :tcpi:ERR /0: hcimonitord:09/25/2009 02:50:30] write failed: Broken pipe
  
  The following seems to be reoccurring throughout the .err file:
  
  ~~Quote:~~
  
  [aler:aler:WARN/0: hcimonitord:09/25/2009 02:50:30] Creating AlertAction: cascade
  
  [cmd :cmd :WARN/0: hcimonitord:09/25/2009 02:50:30] alerts client 0x20c53f38
  
  [aler:aler:WARN/0: hcimonitord:09/25/2009 02:50:30] Creating AlertAction: eocnotify
  
  [cmd :cmd :WARN/0: hcimonitord:09/25/2009 02:50:30] eonotify client 0x20c53f38
  
  ..And then sometimes we get this in the .err file:
  
  ~~Quote:~~
  
  [icl :tcpi:ERR /0: hcimonitord:09/24/2009 13:35:39] write failed: Broken pipe
  
  [cmd :cmd :WARN/0: hcimonitord:09/24/2009 13:35:39] Invalid connection, tcpip = 0x0
  
  Any ideas?
  
  thx
- September 25, 2009 at 3:32 pm #69145
  Tom Rioux
  Participant
  Troy,
  
  The -f option merely forces the site daemons to stop, even though there are processes running. Also, I agree, normally, you don’t want to remove the pid’s while a process is running. In our case, both were a necessary evil.
  
  Tom
- September 25, 2009 at 5:48 pm #69146
  mike brown
  Participant
  Hi Thanks for the responses…
  
  I have done all the suggestions mentioned in the responses,
  
  i have setup a cronjob to bounce the monitor for each site(9) every 4 hours.
  
  Still the error occurs, I am in a AIX 5.3 T8 environment, running the client on a windows PC, jvm_args is set to “Xmx-512m”, debug is false, netmon debug is false, I noticed the JAVA and Network traffic has increased 1/3 which is a significant increase. This a nightmare to research with no resolution, I have engaged Healthvision, they are stumped as well.
  
  no duplicate processes in the “ps -ef | grep hcimonitord”.
- September 25, 2009 at 6:19 pm #69147
  Tom Rioux
  Participant
  Do a “ps -ef | grep hciengine” and see if you see duplicate processes.
- September 25, 2009 at 7:23 pm #69148
  mike brown
  Participant
  Hi Thomas
  
  I did a ps -ef | grep hciengine
  
  It returned a list and duplicate processes yes, but in different sites, our processes in our sites number up to 14 for most not all.
- September 25, 2009 at 7:24 pm #69149
  Russ Ross
  Participant
  Also do
  
  ps -ef | grep CloverleafHostServer
  
  to see if you have more than one instance of the hostserver running which will also cause extreme slowness of the IDE and sometimes make it completely hang.
  
  We think interfaces that are using port numbers in the ephermail range above 32K is what is causing multiple instances of the hostserver to launch unexpectedly at our facility.
  
  One day I hope to remediate the interfaces using port numbers above 32K.
  
  This really showed up with a vengence when we hosted cloverleaf level III training at our facility and I gave everyone port numbers in the ephermal range.
  
  Previously we had never seen the problem even once which helps with zeroing in on the potential underlying cause.
  
  Russ Ross
  RussRoss318@gmail.com
- September 25, 2009 at 7:29 pm #69150
  mike brown
  Participant
  one instance of
  
  ps -ef | grep CloverleafHostServer
Author

Replies

Viewing 12 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.