Loss of data during crash

This topic has 7 replies, 6 voices, and was last updated 20 years, 9 months ago by Dan Goodman.

Creator

Topic
May 13, 2005 at 1:49 am #47744
Andre Duguay
Participant
Hi,

Cloverleaf 5.3

AIX 5.2

HACMP 5.2

p650 2×1.45GHz 4GB memory

IBM LPAR

IBM ESS SAN

O/S installed on local drive

Cloverleaf installed on the ESS disks

Last week, we had a hardware problem with a Cisco fiber channel switch that sits between our UNIX box and the ESS disks. When the switch went bad in a way that it stopped talking to the disks, all our interfaces processes kept on going!?! They kept on going for hours; one kept on going for 4 hours. How is that possible if we did not have any physical connection to our ESS disks?!? The ESS disks is where our application is located (/qdx/qdx5.3/integrator), it is where Cloverleaf writes all its process logs and we also have the option “use recovery database” on for all of our processes!?! The processes eventually crashed and at one point we rebooted the server. The switch was then replaced and it is when we got access to the ESS disks that we saw that the switch had stop writing to disk for a long time before our processes crashed.

We suspected that the switch was corrupted to the point of sending bad acknowledgments to the O/S which in turn was telling Cloverleaf that the writes were successful. We got the switch fixed and tried to mimic the same scenario by unplugging the fiber channel cables while we had processes running and the same thing happened … the processes kept on running.

When we discuss the experience with IBM, they tell us it is our application that does not commit to disk all of its I/O requests for write. CAN THAT BE? I have always trusted Cloverleaf to commit to disk to the recovery database for every message at every step along the way. Even our logs are gone. 😯

Has anybody experienced that kind of problems? Could someone from QDX comment on the response I have from IBM?

Sorry for the long explanation!

Thank you for time.

Andre Duguay

Mcgill University Health Centre

andre.duguay@muhc.mcgill.ca
Creator

Topic

Viewing 6 reply threads

Author

Replies
- May 13, 2005 at 11:07 am #56581
  Jim Kosloskey
  Participant
  Andre,
  
  Have you contacted Cloverleaf support directly?
  
  Jim Kosloskey
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 61 years IT – old fart.
- May 13, 2005 at 11:19 am #56582
  Tom Patton
  Participant
  We are on 3.8.1 AIX 5.1 with an EMC san and have seen the same thing. In fact I ran the engine for an entire day that way 2 months ago. I don’t believe the recovery writes are commited. But both you and I have enough memory to handle msgs going through without swaping to disk – which saved me some down time.
  
  In these cases (I’ve had a couple), The smat files are never created and each process directory has contained a number of 0 byte thread files that I needed to clean up once the san was connected.
  
  I was amazed, and happy, that the engine continued to run throughout the weekday without disk – that, to me, seemed to be a benefit rather than a drawback.
- May 13, 2005 at 12:54 pm #56583
  Jeff Thomas
  Participant
  I
- May 13, 2005 at 1:43 pm #56584
  Anonymous
  Participant
  Hi,
  
  We use EMC with both AIX and Sun. We had the wrong configuration in the Sun box and experienced something similar. We stopped writing to the disk but the processes continued working with the RAM.
  
  In some interfaces, we noticed that since the box was unable to write to the disk, the message was accepted but no ACK’ed. The sending application continued sending the same message again and again. Once the path to the disk was restored, all the messages in memory were saved and ACK’ed. We noticed this because we realized that we had duplicate charges.
  
  We also found that when the EMC team updates drivers or do some maintenance, our Sun box is disconnected from the SAN for several seconds (up to 15 seconds some times). This is not a problem unless you have an interface that expects an ACK back in less than 15 seconds.
  
  We monitor the disconnects with a script that “touches” a file in the SAN every 10 seconds. The touch command is timed and if it is taking too long (more than 2 seconds) a page is issued. We were able to catch some instances were the paths to the disk were failing and helped the support group to diagnose the problem. Now we have the right configuration and the EMC admin team knows that we are notified… The number of short outages was reduced drastically too 😉 .
- May 13, 2005 at 5:46 pm #56585
  Andre Duguay
  Participant
  Do we have redundant path? We have from each server 2 fibre channel adapters but going to the same unique switch. This is not our final configuration. We will eventually have 2 switches.
  
  Have I contacted Quovadx? Yes, they say that the writes are commited to disk 😕
  
  ❓ There is a parameter at the process level called “Disk-based Queueing” … we’re not using that. Would someone know know what that would change in this scenarion?
  
  Thanks
  
  Andre Duguay
- May 13, 2005 at 6:54 pm #56586
  Andre Duguay
  Participant
  Hi,
  
  Sorry, I should have researched the Forum on Disk-based queueing before asking the question … I just found a very clear description by Rob Abbott on a the topic called “Disk based queuing 5.2 – set by default” initiated by Jim Kosloskey.
  
  We will scheduled another test with this on. We will then have to determine the trade off between performance and committed writes.
  
  😕 For the comments I have seen I do not seem to have the same concerns that the majority … is there something I don’t understand?
  
  For me, it is important that I have on disk a exact picture of all messages at the state they are at at every moment so that when I have a problem with my primary server (crash with loss of memory), I can fail over to my backup and resume processing with what I see in the recovery database. If what is in the recovery database is an image from hours ago, I will resend messages that I have already sent … readmitting patients; re-ordering tests; etc which would lead to chaos. I know the performance is important but it should be second to the integrity of the delivery.
  
  Thanks,
  
  Andre Duguay
- June 1, 2005 at 5:52 pm #56587
  Dan Goodman
  Participant
  Sort of tangential to this discussion, but we elected to mirror a local physical drive and a SAN drive. When a SAN outage occurs, AIX issues an error report (errpt command), and the SAN side of the mirror goes stale, but the data is captured locally.
  
  When the SAN connect is restored, AIX (5.2) resynchs the SAN disk to the up to date local disk.
  
  Sometimes a belt and suspenders isn’t enough, and a piece of twine is all you have left to hang on to. Overall, the SAN is more robust when operational, but is subject to more frequent maintenance outages than local disk.
Author

Replies

Viewing 6 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.