hcimonitord timed out

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf hcimonitord timed out

  • Creator
    Topic
  • #51323
    Scherie Drewa
    Participant

      hcimonitord.err log, see this error one one site:

      TCL error: Unable to contact hcimonitord: timed out

      I have two sites set up the exact same way, for a lastr alert

      nomsgsrecvd_restart fr_bu_team_a nomsgsrecvd “no messages received from BUMC Team Throughput for 30 minutes – restarting cloverleaf interface”

      When this command is run from the command line, there is no error.

      Running version 5.6

      Thanks Scherie

    Viewing 3 reply threads
    • Author
      Replies
      • #69703
        Troy Morton
        Participant

          I would verify that you only have one monitor daemon running for the site.  Sometimes for various reasons, a second monitor daemon may be inadvertently started for a site which causes strange behaivor.

          On UNIX: ps -f |grep | grep hcimonitord

          On Windows: use task manager

          If you have more than one hcimonitord running, I suggest:

          1 – Stop the hcimonitord with “hcisitectl -k m”

          2 – Manually kill whatever hcimonitord process(es) are left for that site

          3 – Restart hcimonitord using “hcisitectl -s m”

          Hope this helps.

        • #69704
          Scherie Drewa
          Participant

            Only have one hcimonitord running.

            This is the error:

            [aler:aler:WARN/0:  hcimonitord:11/10/2009 13:07:40] Alert #72 triggered.

            alert: {VALUE lastr} {SOURCE {fr_bu_team_a }} {MODE actual} {WITH -2} {COMP {>= 180}} {FOR {nmi

            n 1}} {WINDOW {* * * * * *}} {HOST {}} {ACTION {{tcl {nomsgsrecvd_restart fr_bu_team_a nomsgsrecvd “no messages received

            from BUMC Team Throughput for 30 minutes – restarting cloverleaf interface”}}}}

            action: tcl proc nomsgsrecvd_restart fr_bu_team_a nomsgsrecvd “no messages received from BUMC Team Throughput for 30 minutes

            – restarting cloverleaf interface”

            [aler:aler:ERR /0:  hcimonitord:11/10/2009 13:08:10] Tcl error: Unable to contact hcimonitord: timed out

            [aler:aler:WARN/0:  hcimonitord:11/10/2009 13:08:10] Completed Cascade Actions

            [cmd :cmd :INFO/0:  hcimonitord:11/10/2009 13:08:10] Received command: ‘summary -conn fr_bu_team_a’

            [cmd :cmd :INFO/0:  hcimonitord:11/10/2009 13:08:10] Doing ‘summary’ command with args ‘-conn fr_bu_team_a’

            [icl :tcpi:ERR /0:  hcimonitord:11/10/2009 13:08:10] write failed: Broken pipe

            [cmd :cmd :INFO/0:  hcimonitord:11/10/2009 13:08:10] Inrecoverable socket error.  Closing connection.

            [aler:aler:INFO/0:  hcimonitord:11/10/2009 13:08:10] Removing alerts and wants for connection 0x20d78428

          • #69705
            Bob Richardson
            Participant

              Greetings,

              We are on AIX5.3 TL8 SP8 (lots of alphabet soup) and we had experienced monitor daemon timeouts until Healthvision support took a look at our alerts (we exec ksh scripts) and figured that we needed to execute them in the background otherwise the monitor would hang waiting for the script to complete its execution.  This of course hangs the monitoring function for the entire site.  You may try adding the ampersand (&) to your exec tcl action script (not sure what the syntax is for Windows if you are on that platform) and see if this doesn’t clear up your problem.

              On another front how long has it been since you performed maintenance on this site?  That is shutting it down, clearing shared memory, initializing your databases and starting it back up?  We do a quarterly site scrub and reboot (again AIX Unix) as back when Healthvision support asserted that if you leave Cloverleaf running for more than 3 months (average) strange things start to happen.

              I hope that this will prove useful to you.

            • #69706
              Scherie Drewa
              Participant

                At first we thought about clearing the shared memory.  Since the same error happened in test, that’s where I tried clearing the shared memory.  Continued to have the error.

                We did find the problem though.  The scripts were different in each site (the one that worked and the one that got the error). After comparing the scripts, appears that someone fixed the proc that errorred, long ago, but didn’t copy the proc to the other sites.  So, this alert has not worked for a very long time.

                Thanks for all the suggestions.

                Scheire

            Viewing 3 reply threads
            • The forum ‘Cloverleaf’ is closed to new topics and replies.