Loss of data during crash

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Loss of data during crash

  • Creator
    Topic
  • #47744
    Andre Duguay
    Participant

      Hi,

      Cloverleaf 5.3

      AIX 5.2

      HACMP 5.2

      p650 2×1.45GHz 4GB memory

      IBM LPAR

      IBM ESS SAN

      O/S installed on local drive

      Cloverleaf installed on the ESS disks

      Last week, we had a hardware problem with a Cisco fiber channel switch that sits between our UNIX box and the ESS disks. When the switch went bad in a way that it stopped talking to the disks, all our interfaces processes kept on going!?! They kept on going for hours; one kept on going for 4 hours. How is that possible if we did not have any physical connection to our ESS disks?!? The ESS disks is where our application is located (/qdx/qdx5.3/integrator), it is where Cloverleaf writes all its process logs and we also have the option “use recovery database” on for all of our processes!?! The processes eventually crashed and at one point we rebooted the server. The switch was then replaced and it is when we got access to the ESS disks that we saw that the switch had stop writing to disk for a long time before our processes crashed.

      We suspected that the switch was corrupted to the point of sending bad acknowledgments to the O/S which in turn was telling Cloverleaf that the writes were successful. We got the switch fixed and tried to mimic the same scenario by unplugging the fiber channel cables while we had processes running and the same thing happened … the processes kept on running.

      When we discuss the experience with IBM, they tell us it is our application that does not commit to disk all of its I/O requests for write. CAN THAT BE? I have always trusted Cloverleaf to commit to disk to the recovery database for every message at every step along the way. Even our logs are gone.  😯

      Has anybody experienced that kind of problems? Could someone from QDX comment on the response I have from IBM?

      Sorry for the long explanation!

      Thank you for time.

      Andre Duguay

      Mcgill University Health Centre

      andre.duguay@muhc.mcgill.ca

    Viewing 6 reply threads
    • Author
      Replies
      • #56581
        Jim Kosloskey
        Participant

          Andre,

          Have you contacted Cloverleaf support directly?

          Jim Kosloskey

          email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

        • #56582
          Tom Patton
          Participant

            We are on 3.8.1 AIX 5.1 with an EMC san and have seen the same thing.  In fact I ran the engine for an entire day that way 2 months ago.   I don’t believe the recovery writes are commited.  But both you and I have enough memory to handle msgs going through without swaping to disk – which saved me some down time.

            In these cases (I’ve had a couple), The smat files are never created and each process directory has contained a number of 0 byte thread files that I needed to clean up once the san was connected.

            I was amazed, and happy, that the engine continued to run throughout the weekday without disk – that, to me, seemed to be a benefit rather than a drawback.

          • #56583
            Jeff Thomas
            Participant

              I

            • #56584
              Anonymous
              Participant

                Hi,

                We use EMC with both AIX and Sun. We had the wrong configuration in the Sun box and experienced something similar. We stopped writing to the disk but the processes continued working with the RAM.

                In some interfaces, we noticed that since the box was unable to write to the disk, the message was accepted but no ACK’ed. The sending application continued sending the same message again and again. Once the path to the disk was restored, all the messages in memory were saved and ACK’ed. We noticed this because we realized that we had duplicate charges.

                We also found that when the EMC team updates drivers or do some maintenance, our Sun box is disconnected from the SAN for several seconds (up to 15 seconds some times). This is not a problem unless you have an interface that expects an ACK back in less than 15 seconds.

                We monitor the disconnects with a script that “touches” a file in the SAN every 10 seconds. The touch command is timed and if it is taking too long (more than 2 seconds) a page is issued. We were able to catch some instances were the paths to the disk were failing and helped the support group to diagnose the problem. Now we have the right configuration and the EMC admin team knows that we are notified… The number of short outages was reduced drastically too 😉 .

              • #56585
                Andre Duguay
                Participant

                  Do we have redundant path? We have from each server 2 fibre channel adapters but going to the same unique switch. This is not our final configuration. We will eventually have 2 switches.

                  Have I contacted Quovadx? Yes, they say that the writes are commited to disk  😕

                  There is a parameter at the process level called “Disk-based Queueing” … we’re not using that. Would someone know know what that would change in this scenarion?

                  Thanks

                  Andre Duguay

                • #56586
                  Andre Duguay
                  Participant

                    Hi,

                    Sorry, I should have researched the Forum on Disk-based queueing before asking the question … I just found a very clear description by Rob Abbott on a the topic called “Disk based queuing 5.2 – set by default” initiated by Jim Kosloskey.

                    We will scheduled another test with this on. We will then have to determine the trade off between performance and committed writes.

                    😕 For the comments I have seen I do not seem to have the same concerns that the majority … is there something I don’t understand?

                    For me, it is important that I have on disk a exact picture of all messages at the state they are at at every moment so that when I have a problem with my primary server (crash with loss of memory), I can fail over to my backup and resume processing with what I see in the recovery database. If what is in the recovery database is an image from hours ago, I will resend messages that I have already sent … readmitting patients; re-ordering tests; etc which would lead to chaos. I know the performance is important but it should be second to the integrity of the delivery.

                    Thanks,

                    Andre Duguay

                  • #56587
                    Dan Goodman
                    Participant

                      Sort of tangential to this discussion, but we elected to mirror a local physical drive and a SAN drive. When a SAN outage occurs, AIX issues an error report (errpt command), and the SAN side of the mirror goes stale, but the data is captured locally.

                      When the SAN connect is restored, AIX (5.2) resynchs the SAN disk to the up to date local disk.

                      Sometimes a belt and suspenders isn’t enough, and a piece of twine is all you have left to hang on to. Overall, the SAN is more robust when operational, but is subject to more frequent maintenance outages than local disk.

                  Viewing 6 reply threads
                  • The forum ‘Cloverleaf’ is closed to new topics and replies.