Cloverleaf downtime metrics

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Cloverleaf downtime metrics

  • Creator
    Topic
  • #55696
    Peter Heggie
    Participant

      What do you consider downtime? When there are pending messages but there should not be? How do you eliminate external causes (like network or ancillary system failure)?

      Peter Heggie

    Viewing 3 reply threads
    • Author
      Replies
      • #86190
        Jim Kosloskey
        Participant

          I would think there are really tiers of downtime and some accommodation for the perspective probably should be made.

          For example if the Network or SAN is down, Cloverleaf is probably not going to be functioning. But is that a Cloverleaf or integration downtime? To the end users (either sending or receiving) I would think they would say the Integration (Cloverleaf) is down. Technically that is incorrect

          If a given trading partner is not functioning (let’s say a receiving system has an issue and is not responding with acknowledgments) is Cloverleaf down? No but as far as the trading partner is concerned the integration (and thus by association Cloverleaf) is down.

          So I think the term ‘downtime’ probably can have many connotations.

          And I think that definition or definitions can vary from one shop to another.

          email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

        • #86191
          Peter Heggie
          Participant

            Yes, and that goes along with the current philosophy of “guilty until proven innocent” (or ‘blame the middleware’ or ‘you lost my message’).

            I’m trying to find something measurable that shows Cloverleaf downtime – I guess I’m thinking of processes unable to manage messages, possibly due to a panic or bad tcl code or bad Xlate logic or licenses expiring, etc.

            So already I’m leaning towards individual processes. I’m going to put a lot of effort into marketing this and educating recipients about what is and what is not a Cloverleaf downtime. If an ancillary is down, its not our problem – it could be the ancillary system or it could be the network in between.

            Maybe pending messages is the key but only for Inbound threads, representing an internal issue (for instance, from the above list of possible problems).

            But I guess that does not include the time when the process is running fine, but all the messages are going to the Error DB because of faulty tcl code. Not sure how to count that as ‘minutes of downtime’. Maybe I need two measurements – minutes of downtime and number of Error DB messages generated by bad interface logic.

            Peter Heggie

          • #86192
            Jim Kosloskey
            Participant

              Well you could take the simplistic approach (which many times is better) and simply periodically do a hciprocstatus on all your sites.

              Then evaluate the report for any processes which are not up.

              Any process not up then would be reported as having downtime.

              Once you get into the gradations of potential states considered to be down, I think measuring and reporting becomes much more difficult. But you do need a clear definition of what will be considered the different levels of ‘downtime’ or ‘lack of service’.

              For example if your shop’s definition of downtime is any time messages are not flowing in a normal manner, consider the source ADT being down. That could be tens or hundreds of integrations being considered down. But Cloverleaf is still processing other messages most likely so is Cloverleaf down?

              I think it is essential you negotiate a clear and accurate definition of the various levels of ‘lack of service’ tiers and conditions before attempting to automate the recognition and reporting of the states.

              Many moons ago I consulted in, among other things, Problem Management. The most onerous but essential task was negotiating the clear and accepted definition of what was a ‘problem’ including the definition of downtime/lack of service and the tracking steps towards problem resolution. A lot of time was spent in negotiating those definitions but when that was not done, the Problem Management solution was unsatisfactory.

              email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

            • #86193
              Peter Heggie
              Participant

                I agree 100% – the definitions have to be accepted and clearly defined.

                Taking what you said about high level definitions – I would want to say that downtime is when messages are received but not processed, or messages are not received because the input process is not accepting/picking them up. The latter could be hard to measure reliably and consistently.

                Basically – Cloverleaf is not doing its job of transforming and delivering messages. And yes I guess we need to define categories. This is bigger than a bread box..

                Peter Heggie

            Viewing 3 reply threads
            • The forum ‘Cloverleaf’ is closed to new topics and replies.