Recovery database question

Homepage Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Recovery database question

  • Creator
    Topic
  • #49687
    Mark McDaid
    Participant

    Hello,

    This is my first post on Clovertech.  I just completed the Level 1 class and the Tcl fundamentals class in November.  My question is about the recovery database.  My understanding was that once a message has been successfully sent, the message is removed from the recovery db.  Here is the situation we have.  We have a thread that had connection problems to its destination and so the thread was eventually shut down with about 50,000 messages pending when you viewed the status (I’m assuming those 50,000 messages at that point are in the recovery database.)  The thread was down for a week or so, not a big deal because we are not live yet and this was for testing purposes.  Last night we started the thread again, and all of those 50,000 messages went through.  Today when I look at the recovery database, it’s got 86,135 messages in it.  Are some of these the ones that were sent through, and if so, shouldn’t they have been removed from the recovery db?  How can I tell what time and date a message was written to the db?  If I do hcidbdump -r, I get a time stamp next to each entry, but no date.  And when I dump to a file, I get all of the messages, but no date or time stamp.  I appretiate any help anyone can offer.  Thanks.

Viewing 12 reply threads
  • Author
    Replies
    • #63085
      John Hamilton
      Participant

      Use the -l option it gives you a time date you can use. Also I would dump them in time order as well just to make it easy to follow. That is “-O i”.

      You can get a listing of the option with a -?.

    • #63086
      Mark McDaid
      Participant

      Thank you for the quick response.  I’ll give that a try.

    • #63087
      Mark McDaid
      Participant

      OK, I tried that but mistakenly did the -L instead of -l.  With 86,135 messages, it sat there scrolling through them so fast you can’t read them, but slow enough that after 20 minutes it was still going.  Shortly after I started this, every thread in the engine went down.  I had to abort the dump, which I know you’re not supposed to do, and we had to reboot the server and restart all of the threads.  Our interface is back up and running now.  Is this normal behavior?  Is there a way to view the recovery database without crashing our engine?  Thanks.

    • #63088
      Gary Atkinson
      Participant

      Try dumping all those messages into a file and then delete them.

    • #63089
      Mark McDaid
      Participant

      I know it depends on many factors, but on average, how many messages should be in the recovery database at any given moment during business hours?

    • #63090
      Jim Kosloskey
      Participant

      Mark,

      I don’t think a number can be set that fits everyone – or even close.

      It not only depends on the volume of messages but also the number of destinations, the power of the Cloverleaf(R) hardware platform, and how efficiently receiving systems accept messages.

      For example, if you have one receiving system that averages 30 seconds between receipt of message and acknowledgment and you have peak arrival rate of 1 message every .33 seconds, your pending count (and the number of Recovery DB messages) for that one thread alone in one 30 second wait for reply will increase by about 90 messages. Depending on how long the peak period lasts, you can see that will grow very quickly.

      If you overall arrival rate exceeds 1 message every 30 seconds, then that thread that has gotten behind will probably stay behind for a longer time than just the peak demand period – maybe all day!!

      What if there are other threads which cannot keep up with the arrival rate? Obviously additional messages will reside in the Recovery DB.

      I think you need to determine your peak demand period (typically measured in either 1 hour or 15 minute time frame) for ALL inbound threads combined as well as for each inbound thread individually. You can do this quite simply with the SMAT .idx files. If you can narrow down one period on one day each week to watch that would be best.

      And – you need to determine the performance of your receiving threads. Sometimes casual observation during the determined peak demand period is accurate enough. However, again using the SMAT files, you can analyze the average delay between sending a message and receiving the appropriate acknowledgment. You can also determine if there are consistent resends occurring for each thread (that might mean you need to tweak the wait times – you may even determine you can reduce the wait times that is your choice).

      A quick way to tell if resends are likely occurring is to zero out the stats for the thread in question just as a peak period is about to begin and simply check the Thread Status in the NetMonitor display as the peak period ends. If there are more messages going out than coming in – you likely have that many resends. It is then valuable to find out if that resend activity was evenly spread during the time period or concentrated. Again the SMAT files (in and out) can assist.

      Armed with the above intelligence, you should be able to estimate a general high water mark for the Recovery Database. Obviously if that mark is exceeded during your peak period it does not necessarily mean anything is amiss but warrants observation.

      If the high water mark is exceeded in any other period – either you now have a new candidate for the peak demand period, or the arrival rate peak observed is an anomaly, or there is a problem with one or more of the threads. In any case exceeding the high water mark outside of the determined peak demand period probably warrants review.

      Of course, if one or more of the receiving system threads are down during any period, the number of messages in the Recovery DB will grow. During the peak period they will grow really fast.

      If you have your integrations split into multiple sites, you have one Recovery DB per site. That may or may not be an option should you find you have an issue and canot resolve any of the other variables.

      In order to set the appropriate performance and capacity numbers, you need to spend some time gathering intelligence. The determination of the workload characteristics (peak demand period, individual thread efficiency, etc.) needs to be repeated periodically. This is because the pattern of demand can change due to the hospital business changing, or adding additional inbound threads, or increasing the number of Message/Event Types being sent, adding additional receiving threads (some which may be poor performers), or other considerations).

      A generally accepted cycle for validating workload characteristics is every 6 months. That is only a reference point, constraints may cause you to do this more or less often.

      Jim Kosloskey

      email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

    • #63091
      Mark McDaid
      Participant

      Jim,

      I really appreciate the detailed response you provided.  That really helps.  I am very new to Cloverleaf and obviously there is only so much that can be learned in a 4 day class.  Now I have a good starting point from which I can proceed.  Thank you.   😀

    • #63092
      Kevin Scantlan
      Participant

      If you do a hcidbdump -r -L  you can put the output into a file or you can just pipe it into more or less:

      hcidbdump -r -L | more

    • #63093
      Mark McDaid
      Participant

      Thanks for everyone’s responses.  One mistake I made was doing ‘hcidbdump -r testfile.txt’, where testfile.txt is a parameter for the hcidbdump command which dumps the actual messages to a file.  What I wanted to do, and finally figured out, was pipe the output from the command to a file with ‘hcidbdump -r -l > output.txt’, so that I could see the dates that the messages were written to the database.  This way we were able to determine that all of the 86,135 messages were written to the recovery database on Sept. 21 – Sept. 25.  I still don’t know why those messages were never removed from the db, but we were able to determine that those messages were not needed and inititalized the recovery database to clear out all of the messages.

    • #63094
      Charlie Bursell
      Participant

      WARNING  WARNING

      Be *VERY* careful when piping the output of hcidbdump to less or more.  The problem arises when there are many entries and you get tired of scrolling them so you just Control-C out.  

      THIS CAN CORRUPT THE DATABASE!

      I always redirect the output to a junk file like this:

         hcidbdump -r -L > foo

      Then I can scroll through the junk file at my liesure and simply remove the junk file when finished

      My motto when using the database is: “Get in and get out quickly”

    • #63095
      Michael Hertel
      Participant

      I like to use a different user id too.

      hcidbdump -r -U MIKE -L > foo

      Just incase…

    • #63096
      Mark McDaid
      Participant

      Mike,

      Does that prevent potential conflicts with the engine accessing the database at the same time?  If so, I will definitely do that from now on.  Thanks for the tip.

      Mark.

    • #63097
      Michael Hertel
      Participant

      Not necessarily.

      I use this so that in the event I do control-c by mistake or something blows up, I haven’t affected the default user id of TEST.

      Also, say another support person is looking at the database at the same time and I don’t know it, we won’t walk on each other.

      I’m sure there are other benefits to using another user id, I’m just not aware of them.

      -mh

Viewing 12 reply threads
  • The forum ‘Cloverleaf’ is closed to new topics and replies.

Forum Statistics

Registered Users
5,126
Forums
28
Topics
9,296
Replies
34,439
Topic Tags
287
Empty Topic Tags
10