hcimonitord timed out

This topic has 4 replies, 3 voices, and was last updated 15 years, 7 months ago by Scherie Drewa.

Creator

Topic
November 10, 2009 at 6:16 pm #51323
Scherie Drewa
Participant
hcimonitord.err log, see this error one one site:

TCL error: Unable to contact hcimonitord: timed out

I have two sites set up the exact same way, for a lastr alert

nomsgsrecvd_restart fr_bu_team_a nomsgsrecvd “no messages received from BUMC Team Throughput for 30 minutes – restarting cloverleaf interface”

When this command is run from the command line, there is no error.

Running version 5.6

Thanks Scherie
Creator

Topic

Viewing 3 reply threads

Author

Replies
- November 10, 2009 at 6:46 pm #69703
  Troy Morton
  Participant
  I would verify that you only have one monitor daemon running for the site. Sometimes for various reasons, a second monitor daemon may be inadvertently started for a site which causes strange behaivor.
  
  On UNIX: ps -f |grep | grep hcimonitord
  
  On Windows: use task manager
  
  If you have more than one hcimonitord running, I suggest:
  
  1 – Stop the hcimonitord with “hcisitectl -k m”
  
  2 – Manually kill whatever hcimonitord process(es) are left for that site
  
  3 – Restart hcimonitord using “hcisitectl -s m”
  
  Hope this helps.
- November 10, 2009 at 9:48 pm #69704
  Scherie Drewa
  Participant
  Only have one hcimonitord running.
  
  This is the error:
  
  [aler:aler:WARN/0: hcimonitord:11/10/2009 13:07:40] Alert #72 triggered.
  
  alert: {VALUE lastr} {SOURCE {fr_bu_team_a }} {MODE actual} {WITH -2} {COMP {>= 180}} {FOR {nmi
  
  n 1}} {WINDOW {* * * * * *}} {HOST {}} {ACTION {{tcl {nomsgsrecvd_restart fr_bu_team_a nomsgsrecvd “no messages received
  
  from BUMC Team Throughput for 30 minutes – restarting cloverleaf interface”}}}}
  
  action: tcl proc nomsgsrecvd_restart fr_bu_team_a nomsgsrecvd “no messages received from BUMC Team Throughput for 30 minutes
  
  – restarting cloverleaf interface”
  
  [aler:aler:ERR /0: hcimonitord:11/10/2009 13:08:10] Tcl error: Unable to contact hcimonitord: timed out
  
  [aler:aler:WARN/0: hcimonitord:11/10/2009 13:08:10] Completed Cascade Actions
  
  [cmd :cmd :INFO/0: hcimonitord:11/10/2009 13:08:10] Received command: ‘summary -conn fr_bu_team_a’
  
  [cmd :cmd :INFO/0: hcimonitord:11/10/2009 13:08:10] Doing ‘summary’ command with args ‘-conn fr_bu_team_a’
  
  [icl :tcpi:ERR /0: hcimonitord:11/10/2009 13:08:10] write failed: Broken pipe
  
  [cmd :cmd :INFO/0: hcimonitord:11/10/2009 13:08:10] Inrecoverable socket error. Closing connection.
  
  [aler:aler:INFO/0: hcimonitord:11/10/2009 13:08:10] Removing alerts and wants for connection 0x20d78428
- November 11, 2009 at 1:41 pm #69705
  Bob Richardson
  Participant
  Greetings,
  
  We are on AIX5.3 TL8 SP8 (lots of alphabet soup) and we had experienced monitor daemon timeouts until Healthvision support took a look at our alerts (we exec ksh scripts) and figured that we needed to execute them in the background otherwise the monitor would hang waiting for the script to complete its execution. This of course hangs the monitoring function for the entire site. You may try adding the ampersand (&) to your exec tcl action script (not sure what the syntax is for Windows if you are on that platform) and see if this doesn’t clear up your problem.
  
  On another front how long has it been since you performed maintenance on this site? That is shutting it down, clearing shared memory, initializing your databases and starting it back up? We do a quarterly site scrub and reboot (again AIX Unix) as back when Healthvision support asserted that if you leave Cloverleaf running for more than 3 months (average) strange things start to happen.
  
  I hope that this will prove useful to you.
- November 12, 2009 at 5:33 pm #69706
  Scherie Drewa
  Participant
  At first we thought about clearing the shared memory. Since the same error happened in test, that’s where I tried clearing the shared memory. Continued to have the error.
  
  We did find the problem though. The scripts were different in each site (the one that worked and the one that got the error). After comparing the scripts, appears that someone fixed the proc that errorred, long ago, but didn’t copy the proc to the other sites. So, this alert has not worked for a very long time.
  
  Thanks for all the suggestions.
  
  Scheire
Author

Replies

Viewing 3 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.