Unable to open up Gui/network monitor after failover

Homepage Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Unable to open up Gui/network monitor after failover

  • Creator
    Topic
  • #48196
    Roushanak Sedghi
    Participant

    Hi All,

    QDX 5.2.1 P , AIX 5.2

    We had our failover testing few days ago. Our system administrator added the ip address of the primary box virtually to the failover server so any network traffic destined for the primary server to hit the failover server. At the time of failover we took the primary server down and brought up the failover server. Everything went well, except after migrating to the failover server we no longer were able to start up the network monitor.

    I reset the host server and checked the status of the host server on the failover server(which had the primary server IP address)  and it was running, but we kept on getting the same error

Viewing 11 reply threads
  • Author
    Replies
    • #57975
      Wayne Ladewski
      Participant

      We had the same issue, but we shutdown the Cloverleaf host server service and then changed the etchosts file to match the changed server IP address ad name. Then we changed the server ip address via SMIT and rebooted. Everythin works great bu we cant get the GUI to reconnect

    • #57976
      Peter Heggie
      Participant

      Has anyone had a similar problem? We just went through a controlled fail-over on AIX 6.1 tl2 running Cloverleaf 5.8.4. We had to fail-back without performing the maintenance we wanted to.

      Our setup is active/active, meaning that we have two physical servers, one Live, one Test, the hci software filesystem is local to each server, and it contains links pointing to the sites directories, which are located in another file system that fails over.

      The symptoms are the same (after we looked back later) – everything failed over successfully and quickly except that the monitor would not start. We tried to start it manually several times but each time we got:

      [pti :sign:WARN/0:_hcimonitord_:04/15/2013 13:03:22] Thread 0 ( _hcimonitord_ ) received signal 11

      [pti :sign:WARN/0:_hcimonitord_:04/15/2013 13:03:22] PC = 0xd41f7e44

      We were failing over the production system to the test server. The test server was slightly different than the production server in that a master site was defined in the test server, and that we had installed cis6.0 on the test server only.

      Does anyone have HACMP with one side having a master site, and the other side not having a master site? Where does the master site configuration get stored, and is that taken into consideration in the HACMP / profile scripts?

      The last fail-over worked fine in all regards was done before we setup a master site and before we installed cis6.0.

      Peter

      Peter Heggie

    • #57977
      Richard Hart
      Participant

      Hi Peter.

      Our setup has each developer logging as with their username and I have seen this when one user has selected the Cloverleaf site that belongs to another user.

    • #57978
      Russ Ross
      Participant

      To get around the problem with the IDE not being able to talk to the hostserver in a fail-over setup, I added the following entries at the bottom of my $HCIROOT/server/server.ini file running on the active AIX server:

      Code:

      [firewall]
      rmi_exported_server_port=dopivhub

      Our Ha setup is active/passive and dopivhub is the HA cluster name of the virtual IP address of my production HA cluster that consists of 2 physical nodes (dopilhub1a & dopilhub1b)

      When I launch the IDE, I always tell it to talk to dopivhub regardless of which physical node the production server is running on.

      NOTE:

      Russ Ross
      RussRoss318@gmail.com

    • #57979
      Peter Heggie
      Participant

      thank you both.

      Richard, we do not have a user_xx entry in the client.ini file. I don’t know how that works. We each start the client gui with our own AD / network userid; when on the AIX server, we all login with hci.

      I believe the problem was that the monitor daemon would not start; we could connect to any site we wanted to but there was no status of processes or threads available.

      Russ – I understand what you mean about the physical and service addresses of the cluster; all of our configuration uses the service address, but perhaps more configuration is needed to ensure the outbound traffic uses it also.

      I just don’t know why the monitor daemon kept failing with a sig11. We have successfully failed over several times in the last three years with the same configuration noted above, except that now we have the new master site on Test.

      Also, I forgot to add a second difference – on our test server, we added a new site recently, but did not locate it in the SAN file system, with a link from /integrator; instead we installed it directly in /integrator. I know a lot of the HACMP scripts in .cluster iterate through the sites and maybe having a site directory there was a wrench in the works?

      Peter Heggie

    • #57980
      Russ Ross
      Participant

      If you are just having trouble with launching net monitor after Ha fail-over then make sure any left over pid files in each site ($HCISITEDIR/exec/hcimonitord/pid) got cleaned up before trying to start the net monitor.

      In fact, probably a good idea to check for any left over pid files for each site after fail-over before launching anything

      cd $HCISITEDIR; find . -name pid 2>/dev/null

      Russ Ross
      RussRoss318@gmail.com

    • #57981
      Peter Heggie
      Participant

      thank you, we will do that.

      Now we are still having problems, even though we failed back. On the test server, whenever someone changes site, or just clicks on Server / Change, and clicks OK when warned about screens closing, we get errors in the /integrator/server/logs/server.log:

      lclsrvr:RMI TCP Connection(127)-10.51.2.135: ERR: Unable to read ini Environ:  /hci/cis5.8/integrator/cmonlprd

      lclsrvr:RMI TCP Connection(127)-10.51.2.135: ERR: com.hie.cloverleaf.util.SiteException: /hci/cis5.8/integrator/cmonlprd/siteInfo  doesnt exist

      We get one of these messages for each site that has a link in /integrator but points to a file system that is not connected. I.E. on the test server, we have both prod and test links, and of course the test links are valid and pointing to a file system that is mounted to the test server. But the links that point to a prod file system are not valid, because the prod file system is not mounted, which is correct. But why do we get these error messages? It was not happening before we attempted the failover yesterday.

      Peter Heggie

    • #57982
      Peter Heggie
      Participant

      Now I can’t remember if we actually had prod links on the test server before this. Maybe we only had test links on the test server and prod links on the prod server. I really don’t remember. Certainly right now we only have prod links on the prod server, not both sets. So maybe the failover – failback did not delete the prod links on the test server, and it should have??

      Peter Heggie

    • #57983
      Peter Heggie
      Participant

      I’m digging through this on my own right now, but it seems that whatever sites are listed in …/server/server.ini, in the keyword ‘environs=’, are being checked when ever the gui-client server is changed; we have production sites listed in there alongside the test sites, so when it checks those prod sites, it does not find the SiteInfo file (because the prod sites file system is not mounted), and therefore we get those error messages.

      All that makes sense, but the hacmp cluster script ‘hci.node1.stop’ executes a perl script that updates that server.ini to set the ‘environs=’ value. I think that after that script ran, the value of the environs variable would contain the correct sites, but it did not.

      Code:

      node2sitelist=`ls /test/cis5.8/*/NetConfig | awk -F / ‘{print $4}’`
      # leaving node2, remove node1 sites from server.ini on node2
       for hasite in $node2sitelist;do
           perl $SCRIPTDIR/hci.server.ini $HACMP_HCI_ROOT $hasite
       done

      The script I think is supposed to take the name of the site being passed to it and add it to the environs variable if it is not already there. The output from that script is:

      List found keep original

      List found keep original

      List found keep original

      List found keep original

      There are four test sites to pass in to this script, so the output matches what should happen. However, there seems to be nothing in the parent script that deletes the environs= entry in the first place, so it would still contain the prod sites even after a fail back. So it will always have both prod and test sites in that variable and we would always get those errors in the server.log file.

      Does that make sense?

      Is something supposed to clear out the environs variable before this starts?

      Peter Heggie

    • #57984
      Peter Heggie
      Participant

      We backed out the two items we thought contributed to our failover problem – we removed the master site and moved our newly created site to the correct SAN location with a link in the /integrator directory.

      We failed over and still had the same problem; everything was running except for the monitor, which immediately failed with sig 11. We failed back and will leave it there for a few days.

      We found a script in the HACMP /home/hci/.cluster directory called hasiteinit which seems to be a replacement for hcisiteinit, to be used for creating new sites when under the umbrella of HACMP. We did not use this script and it may be a cause of this problem; we don’t know. We are waiting for feedback from our favorite consultant.

      Not sure what could cause a sig 11.

      Peter Heggie

    • #57985
      Peter Heggie
      Participant

      I think it was the hcisiteinit vs. hasiteinit that caused the problem of the error messages in ../server/logs/server.log. The code is different where they update the server.ini:

      hcisiteinit:

      Code:

      $newPath = “$root/$newName”;

      $search = $newPath;
      $search =~ s//////g;
      $search =~ s/./\./g;
      $search =~ s/:/\:/g;

      $inistring = &getINIFile($serverini);
      $inistring =~ s/$search;//g;
      $inistring =~ s/environs=(.*)/environs=$newPath;$1/g;
      &putINIFile($serverini, $inistring);

      hasiteinit:

      Code:

      $newPath = “$ARGV[1]/$newName”;

      $search = $newPath;
      $search =~ s//////g;
      $search =~ s/./\./g;
      $search =~ s/:/\:/g;

      $inistring = &getINIFile($serverini);
      $inistring =~ s/$search;//g;
      $inistring =~ s/environs=(.*)/environs=$root/$newName;$1/g;
      #$inistring =~ s/environs=(.*)/environs=$newPath;$1/g;
      &putINIFile($serverini, $inistring);

      also, hasiteinit has extra code for the alert file, and I”m not sure if this could have messed up the monitor:

      Code:

      # We need to fix the default alert file to look at the right $QUOVADX_INSTALL_DIR

      local ($alertFile) = “$root/$newName/Alerts/default.alrt”;

      rename(”$alertFile”, “$alertFile.old”);
      open (OLDALERT, “$alertFile”);
      while(   ) {
      chop;
      $_ =~ s/$QUOVADX_INSTALL_DIR/$ENV{’QUOVADX_INSTALL_DIR’}/g;
      print NEWALERT “$_n”;
      }
      close (OLDALERT);
      close (NEWALERT);
      unlink (”$alertFile.old”);

      If I read this right, it is renaming the default.alrt file to a .old, and then creating a new file from the old file but using a different install path, whereever it is used. But I’m looking at our default.alrt files, and none of them specify the install directory..?? Does anyone know why this is necessary? Or if any configuration related to monitoring could be corrupted by the creation of a new site which is part of a failover?

      Peter Heggie

    • #57986
      Peter Heggie
      Participant

      Goutham straightened us out on the site creation under HA. And he also narrowed down our monitor failure to an entry in the Alerts file. There was an entry having ‘ALERT’ in the email text. There is another thread on Clovertech that mentions this problem. However, it is working fine in production. When we failover onto an OS running AIX v6.1 TL7, it fails. What is also odd is that if we inactivate the alert preceding it, the monitor works fine, even though ALERT is in the next entry. We are going forward with CL6.0.

      Peter Heggie

Viewing 11 reply threads
  • The forum ‘Cloverleaf’ is closed to new topics and replies.

Forum Statistics

Registered Users
5,129
Forums
28
Topics
9,301
Replies
34,447
Topic Tags
288
Empty Topic Tags
10