› Clovertech Forums › Read Only Archives › Cloverleaf › Cloverleaf › re: Alerts for Recovery DB
Mark Gathers
WVUH Hospitals
Assuming that you are utlizing the alert tool within QDX IDE, this can be set up by adding a new alert row to the existing ones.
Click on append button.
in the append dialog box, for Alert type -> select forward_count in drop down list.
Select the thread_name for the source list.
In the field adjacent to comparing ==, type the number of messages for the recovery database.
in the action tab, select the alert method you wanted — either exec/notify.
I would prefer using exec and put the following in the command field.
mailx -s ‘%A’
I wanted something that checks the whole Recovery Database and not by each thread. I decided to created a UNIX script and check the message count in the Recovery DB using the following command:
mshcnt=`hcidbdump -r | wc -l | tr -d ” “`
If the mshcnt is over a 1000, the script sends a warning alert message via paging and email to us. Works pretty good.
I doing this because we neglected to cycle a process after deleting a route. Since the destination thread was deleted, the messages continued to build up in the Recovery DB without anyone knowing.
Mark
I agree with you.
Here is the problem with using this command — hcidbdump —
It talks to daemon and then also check the database.
For some reason, if the command is not executed correctly, then there is a higher chances for Database to be corrupted.
Thats the reason, I use the alert tool for monitoring it.
For an instance hcidbump -f| wc -l
Similarly hcidbdump -d
In the alert tool you can add alerts for each threads. In my case, I monitor counts 50 for each threads. If a count reach to 50 I know that destination thread is not processing the data.
Thanks
Reggie
We modified the ‘hciconnstatus’ script a few years ago to display the message count for outbound threads. This grabs the message queued information from the shared memory.
We actualy use this script within our ‘monitoring’ to alert on thread down/disconnect and queue information.
eg
Process Connection State Proto Status Count Started
top_prod_rp ahs_prod_rp_adt_out up up 0 27/04/06 15:39:06
top_prod_rp ccm_prod_rp_adt_out up up 0 27/04/06 15:39:07
top_prod_rp cdc_prod_rp_adt_out up up 0 27/04/06 15:51:10
top_prod_rp cdr_prod_rp_adt_out up up 0 27/04/06 15:39:08
top_prod_rp cwb_prod_rp_adt_out up up 0 27/04/06 15:39:09
top_prod_rp eds_prod_rp_adt_out up up 0 27/04/06 15:39:11
top_prod_rp har_prod_rp_adt_out up up 0 27/04/06 15:39:12
top_prod_rp hmh_prod_rp_adt_out up up 0 14/06/06 15:50:00
top_prod_rp ris_prod_rp_adt_out up up 0 27/04/06 15:39:15
top_prod_rp sol_prod_rp_adt_out up up 0 27/04/06 15:39:16
top_prod_rp sud_prod_rp_adt_out up up 0 27/04/06 15:39:21
top_prod_rp top_prod_rp_adt_rcv up up 0 27/04/06 15:39:17
top_prod_rp top_prod_rp_adt_snd up up 0 27/04/06 15:39:19
top_prod_rp ult_prod_rp_adt_out up up 0 25/05/06 12:52:52
Since the last update on this topic was posted in 2006 I thought I’d bump this to see if 5.8 has a solution for this.
I’m trying to figure out if I can put some type of alert on the recovery DB where if it exceeds a given count of messages in the DB an alert is triggered. Please note, I do not want an alert for each thread but instead just want one for the whole recovery DB which would indicate that our whole engine is slowing down. (or extreamly busy)
There is the “outbound queue depth” alert available that you could setup for each of the interfaces you want to montior. Of course this could get to be a huge job depending on how many sites and threads.
Since we monitor almost 800 threads, I wrote a SHELL program to loop through each site and call a tcl program that goes through the MSI area of the site and pulls out the number of messages waiting and then creates a report that is emailed. Of course I only report those interfaces that have data waiting and that is older than 2 hours old.
I’m looking more for a collective of all threads instead of individual threads. We do not have an operations staff monitoring the interfaces and rely on the alerts. I do not want to get a page in the middle of the night if I have one interface with 100 messages backed up but I do want to get a page if the total of all messages in the recovery database is over 2000. To me this would be the equivilant of an operator calling me in the night to say “the NetMonitor is lit up like a Christmas Tree, please help”. 🙂 I’ve thought about writting a script but after researching it seems unsafe to write a script to use a dbdump on the recovery database.
I had written a script to pull counts from the recovery database using hcidbdump. I never saw any corruption, but it did cause an I/O bottleneck whenever the number of messages in the recovery database got high.
As a solution, I ended up using msiAttch to pull the statistics of pending messages. Here is a slightly modified version of that script that should meet your needs. https://gist.github.com/2322016
Let me know what you think.
Thanks,
Eric