Protocol reconnect fail after HA resource group stop/start

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Protocol reconnect fail after HA resource group stop/start

  • Creator
    Topic
  • #49423
    Todd Lundstedt
    Participant

      Cloverleaf 5.5 rev1

      AIX 5.3

      HACMP 5.3

      LPAR on each of two p550 servers, ESS disk, SDD, etc etc.

    Viewing 4 reply threads
    • Author
      Replies
      • #61891
        Michael Hertel
        Participant

          I could be wrong, but doesn’t hcienginestop use pid and/or cmd_port?

          I know the netmonitor uses cmd_port.

          So isn’t the script in the wrong order?

          ***

          rm -f $HCISITEDIR/exec/processes/$pname/cmd_port

          rm -f $HCISITEDIR/exec/processes/$pname/pid

          echo “### Stopping process ‘$pname’ in $HACMP_HCI_ROOT/$site ### ”

          echo

          hcienginestop -p $pname

          sleep 1

          ***

          Should be:

          ***

          echo “### Stopping process ‘$pname’ in $HACMP_HCI_ROOT/$site ### ”

          echo

          hcienginestop -p $pname

          sleep 1

          rm -f $HCISITEDIR/exec/processes/$pname/cmd_port

          rm -f $HCISITEDIR/exec/processes/$pname/pid

          ***

          And do you really need the rm commands?

          In my humble opinion, hcienginestop is not working because there it doesn’t know which port to talk to and which pid is running. Change the order and see if things clear up.

          -mh

        • #61892
          John Hamilton
          Participant

            Are you using the service file for the ports ?

            Or are you just typing in the numbers ?

            If you are using the host and services files need to make sure they are the same between systems.

            Need to make sure they are the same between system.

            Dumb but that got me once.

            Other then that I would have to think about it some more.

            But this is an intresting problem

          • #61893
            Todd Lundstedt
            Participant

              Michael and John.. thanks for replying.. here are my answers, and more info I have found…

              Michael, the scripts (with “rm” before “enginestop”) are as delivered by Quovadx.  When I got them at the end of ’05, or early ’06, I asked the same question you had several different ways… I wanted to make sure that stopping the process in that way was the correct way to do it.  Quovadx assured me it would work.  They indicated they are simulating a crash (as if the server node crashed).  They said the system has to work in that case anyway, so they take care of cleaning things up on the startup, and take advantage of that by “crashing” the process on the way down, considerably reducing shutdown time.

              Whether or not it works as intended, and as delivered, I don’t know.  When I attempted to implement last year, I never got the cluster working correctly, so I couldn’t test these scripts.  We have been living with a cluster that won’t fallover since then.. knowing we would have to manually move things if something on the prod server failed.

              Now we are moving to new hardware, OS, HA, and Cloverleaf, and I have the opportunity to really test things.

              John,

              I am using /etc/hosts, but not services.  I am manually keying in the ports, and using the labels listed from the “List” button (pulled from /etc/hosts) for the client server name.  I think that might be part of the problem.

              New found info.

              I turned up eoconfig for the entire process.  When the process starts after a RG (resource group) start, the protocol thread attempts connection to x.x.x.62.  When the process starts from the GUI, the protocol thread attempts connection to x.x.x.82.

              Here’s the rub…

              My current real prod/test servers use x.x.x.61 and x.x.x.62, and are named qdxprod and qdxtest in their own /etc/hosts and in DNS.

              My new servers use x.x.x.81 and x.x.x.82 for the service labels qdxtest and qdxprod.

              I have set my NSORDER in /etc/environments

              NSORDER=local,bind,nis

              so resolution SHOULD go to /etc/hosts first.  Why or how the process is getting bind (DNS) over local name resolution when hacmp scripts start the process, I don’t know.

              Things I have tried since my initial post:

              1.  Client thread IP address changed from qdxtest to x.x.x.82:  Success

              2.  Modify etc/hosts entry for x.x.x.82 to be “x.x.x.82   qdxtest  qdxtodd”, change client thread IP address to qdxtodd (qdxtodd doesn’t exist in DNS): Success

              Things I will be trying:

              1.  Reboot nodes to ensure local,bind,nis is actually set (not sure when I made that change and if I rebooted since then).

              2.  Change hacmp startup scripts delivered by quovadx

              su hci -c “hcienginerun…”

              to

              su – hci -c “hcienginerun…”

              3.  Call support and see if I can make David say…. hmmmm.

            • #61894
              Todd Lundstedt
              Participant

                Huzzah!!

                I found the bandaid that works.

                First off, using su – hci just breaks more things.  It allowed things to work for this connection (which happens to be in the current default site), but every process in non-default sites didn’t even come up.  Using su – hci caused the $HCISITEDIR variable to get reset to default, and the process couldn’t be found.

                That’s when the LED lit over my head =)

                Set the env variable NSORDER in the hci.profile file used by the HA scripts.  All works as expected now.

                So, now I have to find out why AIX doesn’t use my NSORDER env variable listed in /etc/environments and /etc/rc.tcpip when a “su hci -c.. ” is issued.  It’s not a huge deal now, because when we go live, it should work (DNS will equal x.x.x.61 or .62 as required), and the NSORDER in hci.profile should suffice for my testing.

              • #61895
                Michael Hertel
                Participant

                  :mrgreen:  HIGH FIVE!  :mrgreen:

              Viewing 4 reply threads
              • The forum ‘Cloverleaf’ is closed to new topics and replies.