Protocol reconnect fail after HA resource group stop/start

This topic has 5 replies, 3 voices, and was last updated 18 years ago by Michael Hertel.

Creator

Topic
July 25, 2007 at 7:48 pm #49423
Todd Lundstedt
Participant
Cloverleaf 5.5 rev1

AIX 5.3

HACMP 5.3

LPAR on each of two p550 servers, ESS disk, SDD, etc etc.
Creator

Topic

Viewing 4 reply threads

Author

Replies
- July 26, 2007 at 4:01 pm #61891
  Michael Hertel
  Participant
  I could be wrong, but doesn’t hcienginestop use pid and/or cmd_port?
  
  I know the netmonitor uses cmd_port.
  
  So isn’t the script in the wrong order?
  
  ***
  
  rm -f $HCISITEDIR/exec/processes/$pname/cmd_port
  
  rm -f $HCISITEDIR/exec/processes/$pname/pid
  
  echo “### Stopping process ‘$pname’ in $HACMP_HCI_ROOT/$site ### ”
  
  echo
  
  hcienginestop -p $pname
  
  sleep 1
  
  ***
  
  Should be:
  
  ***
  
  echo “### Stopping process ‘$pname’ in $HACMP_HCI_ROOT/$site ### ”
  
  echo
  
  hcienginestop -p $pname
  
  sleep 1
  
  rm -f $HCISITEDIR/exec/processes/$pname/cmd_port
  
  rm -f $HCISITEDIR/exec/processes/$pname/pid
  
  ***
  
  And do you really need the rm commands?
  
  In my humble opinion, hcienginestop is not working because there it doesn’t know which port to talk to and which pid is running. Change the order and see if things clear up.
  
  -mh
- July 26, 2007 at 4:58 pm #61892
  John Hamilton
  Participant
  Are you using the service file for the ports ?
  
  Or are you just typing in the numbers ?
  
  If you are using the host and services files need to make sure they are the same between systems.
  
  Need to make sure they are the same between system.
  
  Dumb but that got me once.
  
  Other then that I would have to think about it some more.
  
  But this is an intresting problem
- July 26, 2007 at 6:04 pm #61893
  Todd Lundstedt
  Participant
  Michael and John.. thanks for replying.. here are my answers, and more info I have found…
  
  Michael, the scripts (with “rm” before “enginestop”) are as delivered by Quovadx. When I got them at the end of ’05, or early ’06, I asked the same question you had several different ways… I wanted to make sure that stopping the process in that way was the correct way to do it. Quovadx assured me it would work. They indicated they are simulating a crash (as if the server node crashed). They said the system has to work in that case anyway, so they take care of cleaning things up on the startup, and take advantage of that by “crashing” the process on the way down, considerably reducing shutdown time.
  
  Whether or not it works as intended, and as delivered, I don’t know. When I attempted to implement last year, I never got the cluster working correctly, so I couldn’t test these scripts. We have been living with a cluster that won’t fallover since then.. knowing we would have to manually move things if something on the prod server failed.
  
  Now we are moving to new hardware, OS, HA, and Cloverleaf, and I have the opportunity to really test things.
  
  John,
  
  I am using /etc/hosts, but not services. I am manually keying in the ports, and using the labels listed from the “List” button (pulled from /etc/hosts) for the client server name. I think that might be part of the problem.
  
  New found info.
  
  I turned up eoconfig for the entire process. When the process starts after a RG (resource group) start, the protocol thread attempts connection to x.x.x.62. When the process starts from the GUI, the protocol thread attempts connection to x.x.x.82.
  
  Here’s the rub…
  
  My current real prod/test servers use x.x.x.61 and x.x.x.62, and are named qdxprod and qdxtest in their own /etc/hosts and in DNS.
  
  My new servers use x.x.x.81 and x.x.x.82 for the service labels qdxtest and qdxprod.
  
  I have set my NSORDER in /etc/environments
  
  NSORDER=local,bind,nis
  
  so resolution SHOULD go to /etc/hosts first. Why or how the process is getting bind (DNS) over local name resolution when hacmp scripts start the process, I don’t know.
  
  Things I have tried since my initial post:
  
  1. Client thread IP address changed from qdxtest to x.x.x.82: Success
  
  2. Modify etc/hosts entry for x.x.x.82 to be “x.x.x.82 qdxtest qdxtodd”, change client thread IP address to qdxtodd (qdxtodd doesn’t exist in DNS): Success
  
  Things I will be trying:
  
  1. Reboot nodes to ensure local,bind,nis is actually set (not sure when I made that change and if I rebooted since then).
  
  2. Change hacmp startup scripts delivered by quovadx
  
  su hci -c “hcienginerun…”
  
  to
  
  su – hci -c “hcienginerun…”
  
  3. Call support and see if I can make David say…. hmmmm.
- July 26, 2007 at 6:50 pm #61894
  Todd Lundstedt
  Participant
  Huzzah!!
  
  I found the bandaid that works.
  
  First off, using su – hci just breaks more things. It allowed things to work for this connection (which happens to be in the current default site), but every process in non-default sites didn’t even come up. Using su – hci caused the $HCISITEDIR variable to get reset to default, and the process couldn’t be found.
  
  That’s when the LED lit over my head =)
  
  Set the env variable NSORDER in the hci.profile file used by the HA scripts. All works as expected now.
  
  So, now I have to find out why AIX doesn’t use my NSORDER env variable listed in /etc/environments and /etc/rc.tcpip when a “su hci -c.. ” is issued. It’s not a huge deal now, because when we go live, it should work (DNS will equal x.x.x.61 or .62 as required), and the NSORDER in hci.profile should suffice for my testing.
- July 26, 2007 at 6:53 pm #61895
  Michael Hertel
  Participant
  HIGH FIVE!
Author

Replies

Viewing 4 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.