Cloverleaf Restart Steps After A Cluster Failure

This topic has 4 replies, 4 voices, and was last updated 10 years ago by Max Drown (Infor).

Creator

Topic
June 12, 2015 at 4:06 pm #54719
Gina Borden
Participant
A couple of weeks ago we experienced a hardware failure in our network that caused my Cloverleaf cluster to lose connection. The recovery to bring my Cloverleaf interfaces back online was about 4 hours. Part of that was trying to determine why I could see the cluster, but couldn’t get the Cloverleaf Gui to start. Once I stopped and started my cluster services on my AIX box, I was able to get in. At that point, due to the hard crash, databases were corrupted and messages were hung, when had to be dropped to a file to try to minimize loss of data.

My question today is, what steps could I have taken to minimize the downtime in this situation?

Here is info about my server:

26 – Sites

121 – Processes

AIX 6.1.0.0

Cloverleaf 6.0.2.0

Any help is greatly appreciated.

Thanks,

Gina
Creator

Topic

Viewing 3 reply threads

Author

Replies
- June 16, 2015 at 2:46 pm #82703
  Rob Lindsey
  Participant
  Due to some unusual circumstances I took out the “autostart” of the CL application and it is now a manual process for my team. In the 5 years of being here at this company we have had 4 failovers due to hardware issues. Every single time, the CL application started but with issues. It is easier for us to go in and manually do the startup of each site after checking the databases.
  
  I do have an automated scripts that we run from the command line and this seems to help us. Below is Part of the script. There is a for loop before these lines below to get the sites from the system.
  
  setsite $site
  
  hcisitecleanup
  
  rm $HCISITEDIR/exec/monitorShmemFile
  
  rm $HCISITEDIR/exec/databases/vista.taf
  
  rm $HCISITEDIR/lock/*
  
  hcimsiutil -Z
  
  hcidbinit -if
  
  keybuild rlog
  
  keybuild elog
  
  dchain rlog
  
  dchain elog
  
  I know that it is not exactly what you wanted to read but the above does do a rebuild on the databases to try and fix the issue before having to do the process of writing out the msgs to a file and doing resends.
  
  Rob
- June 17, 2015 at 4:03 pm #82704
  Max Drown (Infor)
  Keymaster
  Gina, are you using the Cloverleaf HA Scripts? These scripts are designed to cleanly bring up/down Cloverleaf in the event of a scheduled downtime or a crash.
  
  -- Max Drown (Infor)
- July 22, 2015 at 4:41 pm #82705
  bill whatley
  Participant
  I can echo Rob’s sentiments, although we still let the HA startup scripts start the engine and the threads. The one non-admin initiated failover in the last 5+ years here didn’t work because essential resources became unavailable.
- July 22, 2015 at 4:47 pm #82706
  Max Drown (Infor)
  Keymaster
  That HA system has dramatically improved over the years at an operating system level, SAN disks, the Cloverleaf Raima database, and the HA scripts. I’d recommend taking a look at the latest technology available.
  
  -- Max Drown (Infor)
Author

Replies

Viewing 3 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.