We have about 70 production sites and have set our (AIX HACMP) environment up to auto stop and auto start All sites as appropriate. Our start scripts do some database checks, start the site and then report on any issues in the log files.
Somehow, the SAN disc was unmounted with the Cloverleaf sites running! I was called up to ‘help’ Tech Services and logged in to find that our disc was gone!
1. A ‘problem’. After a server bounce and a Cloverleaf start we had a number of ‘database’ issues.
siteLogChk.sh: [eds_prod_qe] Checking for occurrences of ‘:dbi :ERR’ 2
[Log file entry …. Db_Vista database error -902: ‘SYSTEM/OS error: -902
page fault
C errno = 0
C errno = 0′]
These didn’t seem to cause any issues, but weren’t fixed with the normal database’ fix’ scripts. These required a site stop, drain of messages and database re-init.
2. A ‘funny’ error was a site that would not start because the engine watch process was active. This turned out to be a ‘simple’ case of the enginewatch pid file (wpid) containing the pid of an active process!
3. The ‘best’ was site that continually crashed on startup because there were two (2) State 14 message for the same port! I tracked one of the messages to have been successfully ACKed previously, so in the end not an issue.
4. A ‘not so funny’ error where a message appears to have been ‘corrupted’, so it contained information from other messages!