Cloverleaf Crash Every Two Weeks

Clovertech Forums Cloverleaf Cloverleaf Crash Every Two Weeks

  • Creator
    Topic
  • #120864
    Jared Parish
    Participant

      Hello Clovertech and Friends,

      I think I’ve got an OS issue and I need to bend the ear of those with Redhat/Linux knowledge.

      We have cloverleaf 19.1.2 setup in a high availability configuration.  Just on our second node, every two weeks — nearly exactly, All Cloverleaf processes will panic. When we run Cloverleaf on Node 1, this does not happen.  The process log indicates the following:
      <pre>[pti :sele:WARN/0: proc1_cmd:09/13/2023 16:41:12] Select returns -1 4: Interrupted system call
      [dbi :dbi :ERR /0:proc1_xlate:09/13/2023 16:41:12] (-925) ‘RDM Embedded DB error: “LMC error: -925
      [dbi :dbi :ERR /0:proc1_xlate:–/–/—- –:–:–] Lock manager communication error
      [dbi :dbi :ERR /0:proc1_xlate:–/–/—- –:–:–] C errno = 32: Broken pipe”
      [dbi :dbi :ERR /0:proc1_xlate:–/–/—- –:–:–] ‘
      PANIC: “(errnum > -900) || (errnum < -976)”
      PANIC: Calling “pti” for thread proc1_cmd
      —– Scheduler State —–</pre>
      I’ve reviewed many system logs, but can’t find anything interesting.   The journalctl.log doesn’t seem to indicate any ‘smoking guns’.  I don’t believe it’s an HA issues.

      Anyone know what else I might be able to check?

      Current Configuration:

      OS: Red Hat Enterprise Linux release 8.8 (Ootpa)

      Running on VMWare.

      Cloverleaf: 19.1.2.1P High Availability.

      - Jared Parish

    Viewing 2 reply threads
    • Author
      Replies
      • #120865
        Paul Stein
        Participant

          Are you running active/active or active/passive?

          Agree – it doesn’t look like HA issue either, except for the RDM mention in the logs. Looks like the local volume for the shared storage may not be available temporarily on that node for some reason.

          When this happens – I’d run ‘pcs status’ on the affected node and see if your LVM resource is running.

        • #120866
          Jared Parish
          Participant

            Hi Paul,

            Active / Passive.  pcs status indicated everything was ‘online’ and no mention of issues.

            - Jared Parish

          • #120867
            Paul Stein
            Participant

              Looks like you have it covered. Not sure I can offer much more help. May be good for Red hat support if you have it.

              Out of curiosity though – are the panics happening on the passive node at the time?

          Viewing 2 reply threads
          • You must be logged in to reply to this topic.