Cloverleaf Restart Steps After A Cluster Failure

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Cloverleaf Restart Steps After A Cluster Failure

  • Creator
    Topic
  • #54719
    Gina Borden
    Participant

      A couple of weeks ago we experienced a hardware failure in our network that caused my Cloverleaf cluster to lose connection.  The recovery to bring my Cloverleaf interfaces back online was about 4 hours.  Part of that was trying to determine why I could see the cluster, but couldn’t get the Cloverleaf Gui to start.  Once I stopped and started my cluster services on my AIX box, I was able to get in.  At that point, due to the hard crash, databases were corrupted and messages were hung, when had to be dropped to a file to try to minimize loss of data.

      My question today is, what steps could I have taken to minimize the downtime in this situation?

      Here is info about my server:

      26 – Sites

      121 – Processes

      AIX 6.1.0.0

      Cloverleaf 6.0.2.0

      Any help is greatly appreciated.

      Thanks,

      Gina

    Viewing 3 reply threads
    • Author
      Replies
      • #82703
        Rob Lindsey
        Participant

          Due to some unusual circumstances I took out the “autostart” of the CL application and it is now a manual process for my team.  In the 5 years of being here at this company we have had 4 failovers due to hardware issues.  Every single time, the CL application started but with issues.  It is easier for us to go in and manually do the startup of each site after checking the databases.

          I do have an automated scripts that we run from the command line and this seems to help us.  Below is Part of the script.  There is a for loop before these lines below to get the sites from the system.

             setsite $site

             hcisitecleanup

             rm $HCISITEDIR/exec/monitorShmemFile

             rm $HCISITEDIR/exec/databases/vista.taf

             rm $HCISITEDIR/lock/*

             hcimsiutil -Z

             hcidbinit -if

             keybuild rlog

             keybuild elog

             dchain rlog

             dchain elog

          I know that it is not exactly what you wanted to read but the above does do a rebuild on the databases to try and fix the issue before having to do the process of writing out the msgs to a file and doing resends.

          Rob

        • #82704

          Gina, are you using the Cloverleaf HA Scripts? These scripts are designed to cleanly bring up/down Cloverleaf in the event of a scheduled downtime or a crash.

          -- Max Drown (Infor)

        • #82705
          bill whatley
          Participant

            I can echo Rob’s sentiments, although we still let the HA startup scripts start the engine and the threads.  The one non-admin initiated failover in the last 5+ years here didn’t work because essential resources became unavailable.

          • #82706

            That HA system has dramatically improved over the years at an operating system level, SAN disks, the Cloverleaf Raima database, and the HA scripts. I’d recommend taking a look at the latest technology available.

            -- Max Drown (Infor)

        Viewing 3 reply threads
        • The forum ‘Cloverleaf’ is closed to new topics and replies.