› Clovertech Forums › Read Only Archives › Cloverleaf › Cloverleaf › Protocol reconnect fail after HA resource group stop/start
AIX 5.3
HACMP 5.3
LPAR on each of two p550 servers, ESS disk, SDD, etc etc.
I know the netmonitor uses cmd_port.
So isn’t the script in the wrong order?
***
rm -f $HCISITEDIR/exec/processes/$pname/cmd_port
rm -f $HCISITEDIR/exec/processes/$pname/pid
echo “### Stopping process ‘$pname’ in $HACMP_HCI_ROOT/$site ### ”
echo
hcienginestop -p $pname
sleep 1
***
Should be:
***
echo “### Stopping process ‘$pname’ in $HACMP_HCI_ROOT/$site ### ”
echo
hcienginestop -p $pname
sleep 1
rm -f $HCISITEDIR/exec/processes/$pname/cmd_port
rm -f $HCISITEDIR/exec/processes/$pname/pid
***
And do you really need the rm commands?
In my humble opinion, hcienginestop is not working because there it doesn’t know which port to talk to and which pid is running. Change the order and see if things clear up.
-mh
Or are you just typing in the numbers ?
If you are using the host and services files need to make sure they are the same between systems.
Need to make sure they are the same between system.
Dumb but that got me once.
Other then that I would have to think about it some more.
But this is an intresting problem
Michael, the scripts (with “rm” before “enginestop”) are as delivered by Quovadx. When I got them at the end of ’05, or early ’06, I asked the same question you had several different ways… I wanted to make sure that stopping the process in that way was the correct way to do it. Quovadx assured me it would work. They indicated they are simulating a crash (as if the server node crashed). They said the system has to work in that case anyway, so they take care of cleaning things up on the startup, and take advantage of that by “crashing” the process on the way down, considerably reducing shutdown time.
Whether or not it works as intended, and as delivered, I don’t know. When I attempted to implement last year, I never got the cluster working correctly, so I couldn’t test these scripts. We have been living with a cluster that won’t fallover since then.. knowing we would have to manually move things if something on the prod server failed.
Now we are moving to new hardware, OS, HA, and Cloverleaf, and I have the opportunity to really test things.
John,
I am using /etc/hosts, but not services. I am manually keying in the ports, and using the labels listed from the “List” button (pulled from /etc/hosts) for the client server name. I think that might be part of the problem.
New found info.
I turned up eoconfig for the entire process. When the process starts after a RG (resource group) start, the protocol thread attempts connection to x.x.x.62. When the process starts from the GUI, the protocol thread attempts connection to x.x.x.82.
Here’s the rub…
My current real prod/test servers use x.x.x.61 and x.x.x.62, and are named qdxprod and qdxtest in their own /etc/hosts and in DNS.
My new servers use x.x.x.81 and x.x.x.82 for the service labels qdxtest and qdxprod.
I have set my NSORDER in /etc/environments
NSORDER=local,bind,nis
so resolution SHOULD go to /etc/hosts first. Why or how the process is getting bind (DNS) over local name resolution when hacmp scripts start the process, I don’t know.
Things I have tried since my initial post:
1. Client thread IP address changed from qdxtest to x.x.x.82: Success
2. Modify etc/hosts entry for x.x.x.82 to be “x.x.x.82 qdxtest qdxtodd”, change client thread IP address to qdxtodd (qdxtodd doesn’t exist in DNS): Success
Things I will be trying:
1. Reboot nodes to ensure local,bind,nis is actually set (not sure when I made that change and if I rebooted since then).
2. Change hacmp startup scripts delivered by quovadx
su hci -c “hcienginerun…”
to
su – hci -c “hcienginerun…”
3. Call support and see if I can make David say…. hmmmm.
Huzzah!!
I found the bandaid that works.
First off, using su – hci just breaks more things. It allowed things to work for this connection (which happens to be in the current default site), but every process in non-default sites didn’t even come up. Using su – hci caused the $HCISITEDIR variable to get reset to default, and the process couldn’t be found.
That’s when the LED lit over my head =)
Set the env variable NSORDER in the hci.profile file used by the HA scripts. All works as expected now.
So, now I have to find out why AIX doesn’t use my NSORDER env variable listed in /etc/environments and /etc/rc.tcpip when a “su hci -c.. ” is issued. It’s not a huge deal now, because when we go live, it should work (DNS will equal x.x.x.61 or .62 as required), and the NSORDER in hci.profile should suffice for my testing.