Unable to open up Gui/network monitor after failover

This topic has 12 replies, 5 voices, and was last updated 12 years, 3 months ago by Peter Heggie.

Creator

Topic
December 12, 2005 at 4:34 pm #48196
Roushanak Sedghi
Participant
Hi All,

QDX 5.2.1 P , AIX 5.2

We had our failover testing few days ago. Our system administrator added the ip address of the primary box virtually to the failover server so any network traffic destined for the primary server to hit the failover server. At the time of failover we took the primary server down and brought up the failover server. Everything went well, except after migrating to the failover server we no longer were able to start up the network monitor.

I reset the host server and checked the status of the host server on the failover server(which had the primary server IP address) and it was running, but we kept on getting the same error
Creator

Topic

Viewing 11 reply threads

Author

Replies
- January 28, 2013 at 3:40 pm #57975
  Wayne Ladewski
  Participant
  We had the same issue, but we shutdown the Cloverleaf host server service and then changed the etchosts file to match the changed server IP address ad name. Then we changed the server ip address via SMIT and rebooted. Everythin works great bu we cant get the GUI to reconnect
- April 15, 2013 at 5:00 pm #57976
  Peter Heggie
  Participant
  Has anyone had a similar problem? We just went through a controlled fail-over on AIX 6.1 tl2 running Cloverleaf 5.8.4. We had to fail-back without performing the maintenance we wanted to.
  
  Our setup is active/active, meaning that we have two physical servers, one Live, one Test, the hci software filesystem is local to each server, and it contains links pointing to the sites directories, which are located in another file system that fails over.
  
  The symptoms are the same (after we looked back later) – everything failed over successfully and quickly except that the monitor would not start. We tried to start it manually several times but each time we got:
  
  [pti :sign:WARN/0:_hcimonitord_:04/15/2013 13:03:22] Thread 0 ( _hcimonitord_ ) received signal 11
  
  [pti :sign:WARN/0:_hcimonitord_:04/15/2013 13:03:22] PC = 0xd41f7e44
  
  We were failing over the production system to the test server. The test server was slightly different than the production server in that a master site was defined in the test server, and that we had installed cis6.0 on the test server only.
  
  Does anyone have HACMP with one side having a master site, and the other side not having a master site? Where does the master site configuration get stored, and is that taken into consideration in the HACMP / profile scripts?
  
  The last fail-over worked fine in all regards was done before we setup a master site and before we installed cis6.0.
  
  Peter
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 16, 2013 at 12:04 am #57977
  Richard Hart
  Participant
  Hi Peter.
  
  Our setup has each developer logging as with their username and I have seen this when one user has selected the Cloverleaf site that belongs to another user.
- April 16, 2013 at 11:12 am #57978
  Russ Ross
  Participant
  To get around the problem with the IDE not being able to talk to the hostserver in a fail-over setup, I added the following entries at the bottom of my $HCIROOT/server/server.ini file running on the active AIX server:
  
  Code: [firewall] rmi_exported_server_port=dopivhub
  
  Our Ha setup is active/passive and dopivhub is the HA cluster name of the virtual IP address of my production HA cluster that consists of 2 physical nodes (dopilhub1a & dopilhub1b)
  
  When I launch the IDE, I always tell it to talk to dopivhub regardless of which physical node the production server is running on.
  
  NOTE:
  
  Russ Ross
  RussRoss318@gmail.com
- April 16, 2013 at 11:44 am #57979
  Peter Heggie
  Participant
  thank you both.
  
  Richard, we do not have a user_xx entry in the client.ini file. I don’t know how that works. We each start the client gui with our own AD / network userid; when on the AIX server, we all login with hci.
  
  I believe the problem was that the monitor daemon would not start; we could connect to any site we wanted to but there was no status of processes or threads available.
  
  Russ – I understand what you mean about the physical and service addresses of the cluster; all of our configuration uses the service address, but perhaps more configuration is needed to ensure the outbound traffic uses it also.
  
  I just don’t know why the monitor daemon kept failing with a sig11. We have successfully failed over several times in the last three years with the same configuration noted above, except that now we have the new master site on Test.
  
  Also, I forgot to add a second difference – on our test server, we added a new site recently, but did not locate it in the SAN file system, with a link from /integrator; instead we installed it directly in /integrator. I know a lot of the HACMP scripts in .cluster iterate through the sites and maybe having a site directory there was a wrench in the works?
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 16, 2013 at 1:50 pm #57980
  Russ Ross
  Participant
  If you are just having trouble with launching net monitor after Ha fail-over then make sure any left over pid files in each site ($HCISITEDIR/exec/hcimonitord/pid) got cleaned up before trying to start the net monitor.
  
  In fact, probably a good idea to check for any left over pid files for each site after fail-over before launching anything
  
  cd $HCISITEDIR; find . -name pid 2>/dev/null
  
  Russ Ross
  RussRoss318@gmail.com
- April 16, 2013 at 2:43 pm #57981
  Peter Heggie
  Participant
  thank you, we will do that.
  
  Now we are still having problems, even though we failed back. On the test server, whenever someone changes site, or just clicks on Server / Change, and clicks OK when warned about screens closing, we get errors in the /integrator/server/logs/server.log:
  
  lclsrvr:RMI TCP Connection(127)-10.51.2.135: ERR: Unable to read ini Environ: /hci/cis5.8/integrator/cmonlprd
  
  lclsrvr:RMI TCP Connection(127)-10.51.2.135: ERR: com.hie.cloverleaf.util.SiteException: /hci/cis5.8/integrator/cmonlprd/siteInfo doesnt exist
  
  We get one of these messages for each site that has a link in /integrator but points to a file system that is not connected. I.E. on the test server, we have both prod and test links, and of course the test links are valid and pointing to a file system that is mounted to the test server. But the links that point to a prod file system are not valid, because the prod file system is not mounted, which is correct. But why do we get these error messages? It was not happening before we attempted the failover yesterday.
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 16, 2013 at 2:48 pm #57982
  Peter Heggie
  Participant
  Now I can’t remember if we actually had prod links on the test server before this. Maybe we only had test links on the test server and prod links on the prod server. I really don’t remember. Certainly right now we only have prod links on the prod server, not both sets. So maybe the failover – failback did not delete the prod links on the test server, and it should have??
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 16, 2013 at 5:38 pm #57983
  Peter Heggie
  Participant
  I’m digging through this on my own right now, but it seems that whatever sites are listed in …/server/server.ini, in the keyword ‘environs=’, are being checked when ever the gui-client server is changed; we have production sites listed in there alongside the test sites, so when it checks those prod sites, it does not find the SiteInfo file (because the prod sites file system is not mounted), and therefore we get those error messages.
  
  All that makes sense, but the hacmp cluster script ‘hci.node1.stop’ executes a perl script that updates that server.ini to set the ‘environs=’ value. I think that after that script ran, the value of the environs variable would contain the correct sites, but it did not.
  
  Code: node2sitelist=`ls /test/cis5.8/*/NetConfig | awk -F / ‘{print $4}’` # leaving node2, remove node1 sites from server.ini on node2 for hasite in $node2sitelist;do perl $SCRIPTDIR/hci.server.ini $HACMP_HCI_ROOT $hasite done
  
  The script I think is supposed to take the name of the site being passed to it and add it to the environs variable if it is not already there. The output from that script is:
  
  List found keep original
  
  List found keep original
  
  List found keep original
  
  List found keep original
  
  There are four test sites to pass in to this script, so the output matches what should happen. However, there seems to be nothing in the parent script that deletes the environs= entry in the first place, so it would still contain the prod sites even after a fail back. So it will always have both prod and test sites in that variable and we would always get those errors in the server.log file.
  
  Does that make sense?
  
  Is something supposed to clear out the environs variable before this starts?
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 17, 2013 at 5:14 pm #57984
  Peter Heggie
  Participant
  We backed out the two items we thought contributed to our failover problem – we removed the master site and moved our newly created site to the correct SAN location with a link in the /integrator directory.
  
  We failed over and still had the same problem; everything was running except for the monitor, which immediately failed with sig 11. We failed back and will leave it there for a few days.
  
  We found a script in the HACMP /home/hci/.cluster directory called hasiteinit which seems to be a replacement for hcisiteinit, to be used for creating new sites when under the umbrella of HACMP. We did not use this script and it may be a cause of this problem; we don’t know. We are waiting for feedback from our favorite consultant.
  
  Not sure what could cause a sig 11.
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 18, 2013 at 11:59 am #57985
  Peter Heggie
  Participant
  I think it was the hcisiteinit vs. hasiteinit that caused the problem of the error messages in ../server/logs/server.log. The code is different where they update the server.ini:
  
  hcisiteinit:
  
  Code: $newPath = “$root/$newName”; $search = $newPath; $search =~ s//////g; $search =~ s/./\./g; $search =~ s/:/\:/g; $inistring = &getINIFile($serverini); $inistring =~ s/$search;//g; $inistring =~ s/environs=(.*)/environs=$newPath;$1/g; &putINIFile($serverini, $inistring);
  
  hasiteinit:
  
  Code: $newPath = “$ARGV[1]/$newName”; $search = $newPath; $search =~ s//////g; $search =~ s/./\./g; $search =~ s/:/\:/g; $inistring = &getINIFile($serverini); $inistring =~ s/$search;//g; $inistring =~ s/environs=(.*)/environs=$root/$newName;$1/g; #$inistring =~ s/environs=(.*)/environs=$newPath;$1/g; &putINIFile($serverini, $inistring);
  
  also, hasiteinit has extra code for the alert file, and I”m not sure if this could have messed up the monitor:
  
  Code: # We need to fix the default alert file to look at the right $QUOVADX_INSTALL_DIR local ($alertFile) = “$root/$newName/Alerts/default.alrt”; rename(”$alertFile”, “$alertFile.old”); open (OLDALERT, “$alertFile”); while( ) { chop; $_ =~ s/$QUOVADX_INSTALL_DIR/$ENV{’QUOVADX_INSTALL_DIR’}/g; print NEWALERT “$_n”; } close (OLDALERT); close (NEWALERT); unlink (”$alertFile.old”);
  
  If I read this right, it is renaming the default.alrt file to a .old, and then creating a new file from the old file but using a different install path, whereever it is used. But I’m looking at our default.alrt files, and none of them specify the install directory..?? Does anyone know why this is necessary? Or if any configuration related to monitoring could be corrupted by the creation of a new site which is part of a failover?
  
  Peter Heggie
  PeterHeggie@crouse.org
- May 13, 2013 at 5:54 pm #57986
  Peter Heggie
  Participant
  Goutham straightened us out on the site creation under HA. And he also narrowed down our monitor failure to an entry in the Alerts file. There was an entry having ‘ALERT’ in the email text. There is another thread on Clovertech that mentions this problem. However, it is working fine in production. When we failover onto an OS running AIX v6.1 TL7, it fails. What is also odd is that if we inactivate the alert preceding it, the monitor works fine, even though ALERT is in the next entry. We are going forward with CL6.0.
  
  Peter Heggie
  PeterHeggie@crouse.org
Author

Replies

Viewing 11 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.