Well you could take the simplistic approach (which many times is better) and simply periodically do a hciprocstatus on all your sites.
Then evaluate the report for any processes which are not up.
Any process not up then would be reported as having downtime.
Once you get into the gradations of potential states considered to be down, I think measuring and reporting becomes much more difficult. But you do need a clear definition of what will be considered the different levels of ‘downtime’ or ‘lack of service’.
For example if your shop’s definition of downtime is any time messages are not flowing in a normal manner, consider the source ADT being down. That could be tens or hundreds of integrations being considered down. But Cloverleaf is still processing other messages most likely so is Cloverleaf down?
I think it is essential you negotiate a clear and accurate definition of the various levels of ‘lack of service’ tiers and conditions before attempting to automate the recognition and reporting of the states.
Many moons ago I consulted in, among other things, Problem Management. The most onerous but essential task was negotiating the clear and accepted definition of what was a ‘problem’ including the definition of downtime/lack of service and the tracking steps towards problem resolution. A lot of time was spent in negotiating those definitions but when that was not done, the Problem Management solution was unsatisfactory.
email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.