We are AIX 5.1L, using SAN in a mirrored config (one local, one SAN drive).
We *have* had loss of SAN copy, including one unplanned, due to SAN maintenance. AIX resyncs the mirror in the background when the SAN is restored. This eliminates the issue of possible SAN MTTF (mean time to failure) and MTTR (mean time to recover) being possibly worse than local hardware.
The database corruption point is a good one, although I still see some interest locally in using the shared SAN disk as a quick recovery method.
What we have done instead is (1) acquire identical HW platforms (p615’s beefed up, incl. total of 4 disks); (2) mirror rootvg locally; (3) mirror separate appvg, one local, one SAN; (4) retain 4th drive for future OS upgrades using alt_disk_install; (5) acquired dual HBA’s for primary platform and loaded auto-failover driver (Hitachi HDLM).
(We have license keys, tied to systemID already in place on both platforms.)
In addition, we replicate our $CLROOT/production directory from our primary platform to our secondary platform nightly. This does not conflict with the secondary platform’s role as a development/test machine, as that work is done in $CLROOT/test.
We may, in the future, *in addition to* not in place of, our current config, add an additional SAN disk to our primary mirroring, with the idea of fast failing it, and the app, to the secondary machine.
We expect that this will require additional software from QVDX as well, and proof-of-concept of the same.
Remember that moving the SAN disk from one platform to another does nothing for messages backed up in memory only queues, so you need to either store *all* messages to disk (correct me if I’m wrong), or have one slick retransmit capability with all your ancillaries, preferably automated, but at a minimum, with automatic detection and removal of duplicated transactions.
Not sure what all of this would buy us, in that we can autofail our production (tcpip, SNA, hostnaming) from the primary to the secondary (all but application times) in under five minutes, without a reboot to either box.
The actual runtime from command initiation, is under fifteen seconds, except for the SNA, whose routing is controlled by our z/OS, which interfaces with SMS/Siemens. This is more on the order of five minutes.
Ancillaries with robust socket management seem to pick this up automatically, but there are always a few stragglers that need to bounce their connection…
We like it. 8) (Up 597 days since last, inplace, OS and app upgrade — 3 SAN outages, one unplanned, all due to SAN maintenance — zero HW errors — zero SW outages at OS level).
Dan Goodman