Recovery database question

This topic has 13 replies, 7 voices, and was last updated 17 years, 7 months ago by Michael Hertel.

Creator

Topic
December 7, 2007 at 8:04 pm #49687
Mark McDaid
Participant
Hello,

This is my first post on Clovertech. I just completed the Level 1 class and the Tcl fundamentals class in November. My question is about the recovery database. My understanding was that once a message has been successfully sent, the message is removed from the recovery db. Here is the situation we have. We have a thread that had connection problems to its destination and so the thread was eventually shut down with about 50,000 messages pending when you viewed the status (I’m assuming those 50,000 messages at that point are in the recovery database.) The thread was down for a week or so, not a big deal because we are not live yet and this was for testing purposes. Last night we started the thread again, and all of those 50,000 messages went through. Today when I look at the recovery database, it’s got 86,135 messages in it. Are some of these the ones that were sent through, and if so, shouldn’t they have been removed from the recovery db? How can I tell what time and date a message was written to the db? If I do hcidbdump -r, I get a time stamp next to each entry, but no date. And when I dump to a file, I get all of the messages, but no date or time stamp. I appretiate any help anyone can offer. Thanks.
Creator

Topic

Viewing 12 reply threads

Author

Replies
- December 7, 2007 at 8:13 pm #63085
  John Hamilton
  Participant
  Use the -l option it gives you a time date you can use. Also I would dump them in time order as well just to make it easy to follow. That is “-O i”.
  
  You can get a listing of the option with a -?.
- December 7, 2007 at 8:31 pm #63086
  Mark McDaid
  Participant
  Thank you for the quick response. I’ll give that a try.
- December 7, 2007 at 9:23 pm #63087
  Mark McDaid
  Participant
  OK, I tried that but mistakenly did the -L instead of -l. With 86,135 messages, it sat there scrolling through them so fast you can’t read them, but slow enough that after 20 minutes it was still going. Shortly after I started this, every thread in the engine went down. I had to abort the dump, which I know you’re not supposed to do, and we had to reboot the server and restart all of the threads. Our interface is back up and running now. Is this normal behavior? Is there a way to view the recovery database without crashing our engine? Thanks.
- December 8, 2007 at 1:36 am #63088
  Gary Atkinson
  Participant
  Try dumping all those messages into a file and then delete them.
- December 9, 2007 at 2:42 am #63089
  Mark McDaid
  Participant
  I know it depends on many factors, but on average, how many messages should be in the recovery database at any given moment during business hours?
- December 9, 2007 at 4:45 pm #63090
  Jim Kosloskey
  Participant
  Mark,
  
  I don’t think a number can be set that fits everyone – or even close.
  
  It not only depends on the volume of messages but also the number of destinations, the power of the Cloverleaf(R) hardware platform, and how efficiently receiving systems accept messages.
  
  For example, if you have one receiving system that averages 30 seconds between receipt of message and acknowledgment and you have peak arrival rate of 1 message every .33 seconds, your pending count (and the number of Recovery DB messages) for that one thread alone in one 30 second wait for reply will increase by about 90 messages. Depending on how long the peak period lasts, you can see that will grow very quickly.
  
  If you overall arrival rate exceeds 1 message every 30 seconds, then that thread that has gotten behind will probably stay behind for a longer time than just the peak demand period – maybe all day!!
  
  What if there are other threads which cannot keep up with the arrival rate? Obviously additional messages will reside in the Recovery DB.
  
  I think you need to determine your peak demand period (typically measured in either 1 hour or 15 minute time frame) for ALL inbound threads combined as well as for each inbound thread individually. You can do this quite simply with the SMAT .idx files. If you can narrow down one period on one day each week to watch that would be best.
  
  And – you need to determine the performance of your receiving threads. Sometimes casual observation during the determined peak demand period is accurate enough. However, again using the SMAT files, you can analyze the average delay between sending a message and receiving the appropriate acknowledgment. You can also determine if there are consistent resends occurring for each thread (that might mean you need to tweak the wait times – you may even determine you can reduce the wait times that is your choice).
  
  A quick way to tell if resends are likely occurring is to zero out the stats for the thread in question just as a peak period is about to begin and simply check the Thread Status in the NetMonitor display as the peak period ends. If there are more messages going out than coming in – you likely have that many resends. It is then valuable to find out if that resend activity was evenly spread during the time period or concentrated. Again the SMAT files (in and out) can assist.
  
  Armed with the above intelligence, you should be able to estimate a general high water mark for the Recovery Database. Obviously if that mark is exceeded during your peak period it does not necessarily mean anything is amiss but warrants observation.
  
  If the high water mark is exceeded in any other period – either you now have a new candidate for the peak demand period, or the arrival rate peak observed is an anomaly, or there is a problem with one or more of the threads. In any case exceeding the high water mark outside of the determined peak demand period probably warrants review.
  
  Of course, if one or more of the receiving system threads are down during any period, the number of messages in the Recovery DB will grow. During the peak period they will grow really fast.
  
  If you have your integrations split into multiple sites, you have one Recovery DB per site. That may or may not be an option should you find you have an issue and canot resolve any of the other variables.
  
  In order to set the appropriate performance and capacity numbers, you need to spend some time gathering intelligence. The determination of the workload characteristics (peak demand period, individual thread efficiency, etc.) needs to be repeated periodically. This is because the pattern of demand can change due to the hospital business changing, or adding additional inbound threads, or increasing the number of Message/Event Types being sent, adding additional receiving threads (some which may be poor performers), or other considerations).
  
  A generally accepted cycle for validating workload characteristics is every 6 months. That is only a reference point, constraints may cause you to do this more or less often.
  
  Jim Kosloskey
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- December 10, 2007 at 12:27 am #63091
  Mark McDaid
  Participant
  Jim,
  
  I really appreciate the detailed response you provided. That really helps. I am very new to Cloverleaf and obviously there is only so much that can be learned in a 4 day class. Now I have a good starting point from which I can proceed. Thank you. 😀
- December 10, 2007 at 3:27 pm #63092
  Kevin Scantlan
  Participant
  If you do a hcidbdump -r -L you can put the output into a file or you can just pipe it into more or less:
  
  hcidbdump -r -L | more
- December 10, 2007 at 3:37 pm #63093
  Mark McDaid
  Participant
  Thanks for everyone’s responses. One mistake I made was doing ‘hcidbdump -r testfile.txt’, where testfile.txt is a parameter for the hcidbdump command which dumps the actual messages to a file. What I wanted to do, and finally figured out, was pipe the output from the command to a file with ‘hcidbdump -r -l > output.txt’, so that I could see the dates that the messages were written to the database. This way we were able to determine that all of the 86,135 messages were written to the recovery database on Sept. 21 – Sept. 25. I still don’t know why those messages were never removed from the db, but we were able to determine that those messages were not needed and inititalized the recovery database to clear out all of the messages.
- December 10, 2007 at 7:30 pm #63094
  Charlie Bursell
  Participant
  WARNING WARNING ❗
  
  Be *VERY* careful when piping the output of hcidbdump to less or more. The problem arises when there are many entries and you get tired of scrolling them so you just Control-C out.
  
  THIS CAN CORRUPT THE DATABASE!
  
  I always redirect the output to a junk file like this:
  
  hcidbdump -r -L > foo
  
  Then I can scroll through the junk file at my liesure and simply remove the junk file when finished
  
  My motto when using the database is: “Get in and get out quickly”
- December 10, 2007 at 9:50 pm #63095
  Michael Hertel
  Participant
  I like to use a different user id too.
  
  hcidbdump -r -U MIKE -L > foo
  
  Just incase…
- December 11, 2007 at 1:31 pm #63096
  Mark McDaid
  Participant
  Mike,
  
  Does that prevent potential conflicts with the engine accessing the database at the same time? If so, I will definitely do that from now on. Thanks for the tip.
  
  Mark.
- December 11, 2007 at 3:03 pm #63097
  Michael Hertel
  Participant
  Not necessarily.
  
  I use this so that in the event I do control-c by mistake or something blows up, I haven’t affected the default user id of TEST.
  
  Also, say another support person is looking at the database at the same time and I don’t know it, we won’t walk on each other.
  
  I’m sure there are other benefits to using another user id, I’m just not aware of them.
  
  -mh
Author

Replies

Viewing 12 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.