Cloverleaf 5.3
AIX 5.2
HACMP 5.2
p650 2×1.45GHz 4GB memory
IBM LPAR
IBM ESS SAN
O/S installed on local drive
Cloverleaf installed on the ESS disks
Last week, we had a hardware problem with a Cisco fiber channel switch that sits between our UNIX box and the ESS disks. When the switch went bad in a way that it stopped talking to the disks, all our interfaces processes kept on going!?! They kept on going for hours; one kept on going for 4 hours. How is that possible if we did not have any physical connection to our ESS disks?!? The ESS disks is where our application is located (/qdx/qdx5.3/integrator), it is where Cloverleaf writes all its process logs and we also have the option “use recovery database” on for all of our processes!?! The processes eventually crashed and at one point we rebooted the server. The switch was then replaced and it is when we got access to the ESS disks that we saw that the switch had stop writing to disk for a long time before our processes crashed.
We suspected that the switch was corrupted to the point of sending bad acknowledgments to the O/S which in turn was telling Cloverleaf that the writes were successful. We got the switch fixed and tried to mimic the same scenario by unplugging the fiber channel cables while we had processes running and the same thing happened … the processes kept on running.
When we discuss the experience with IBM, they tell us it is our application that does not commit to disk all of its I/O requests for write. CAN THAT BE? I have always trusted Cloverleaf to commit to disk to the recovery database for every message at every step along the way. Even our logs are gone. 😯
Has anybody experienced that kind of problems? Could someone from QDX comment on the response I have from IBM?
Sorry for the long explanation!
Thank you for time.
Andre Duguay
Mcgill University Health Centre