High Volume Strategies

Tagged: highvolume

This topic has 11 replies, 5 voices, and was last updated 5 years, 1 month ago by Ryan Boone.

Creator

Topic
May 26, 2020 at 1:43 pm #116883
Ryan Boone
Participant
Does anyone have a workable high-volume strategy they could share for when multiple high-volume external systems are down (due to system or VPN issues)? Several interfaces amass 100k messages in a matter of a few hours, so it becomes a heavy burden on the recovery databases fairly quickly. Are the best options to write to file (where FIFO can apparently be an issue), or to isolate heavy-load interfaces into their own sites?
Creator

Topic

Viewing 9 reply threads

Author

Replies
- May 26, 2020 at 3:48 pm #116884
  Keith McLeod
  Participant
  I make use of a series of alerts.
  1. If the prexqd > 4000 once, then hold the reply.
  2. If the prexqd > 2900 for 2 min, send a notification email.
  3. If the prexqd < 3000 once, release the reply.
  This allows for 1000 messages to be processed while holding the reply and then allows 1000 more from the source before pausing the flow again. Kind of an alert throttle to help protect from the flood. We have had scripts run on source systems that would generate 100s of thousands of messages and bury the engines….
  
  Hope this helps….
- May 26, 2020 at 6:33 pm #116889
  Jim Kosloskey
  Participant
  Are you referring to inbound or outbound threads being an issue?
  
  I have used multiple sites and that reduces the issue but extreme situations can still cause problems.
  
  An issue I have seen with ‘throttling’ the inbound is some source systems cannot manage or tolerate queues on their side.
  
  I have also experienced temporary route to files. There are potential issues with that model as well as you have indictated
  
  If I were to have many systems which do not stay up and they will not resolve the issue, I would lump them together in their own site and inform the system owners their interfaces will be slow and potentially backed up and it is not the fault of Cloverleaf. If they resolve their system such that it is reliable then they can be moved to a site with the ‘good kids’. This does not resolve the issue with the Recovery DB but it focuses the pain on those system that do not play right.
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- May 27, 2020 at 9:50 am #116891
  Ryan Boone
  Participant
  Hi Jim,
  
  “If I were to have many systems which do not stay up and they will not resolve the issue, I would lump them together in their own site and inform the system owners their interfaces will be slow and potentially backed up and it is not the fault of Cloverleaf. If they resolve their system such that it is reliable then they can be moved to a site with the ‘good kids’. This does not resolve the issue with the Recovery DB but it focuses the pain on those system that do not play right.”
  
  Yes, in theory this would be good. But the larger issue is that we have such high volume that it’s not entirely the fault of the OB systems that the interfaces queue up so quickly. And then, if we have VPN issues, that also is not the fault of the external system. So I’m wondering if there is a strategy (preferably automated) to handle such volume without causing a process and/or site crash that would then proliferate the backups to other sites or to the sending system — precluding writing the files to disk to be picked up by another thread since that does not guarantee FIFO.
  - May 28, 2020 at 9:56 am #116902
    Jim Kosloskey
    Participant
    So if the issue is some receiving system cannot keep up with the pace so that an interruption in connection takes you over the cliff.
    
    Has it been determined the receiving system has done everything that can be done to improve their performance so they can keep up? Do they need to receive all the messages (maybe if you can reduce the load they can keep up)?
    
    If so I suppose one way is to send their messages to a numfile. Then set up a Fileset Local thread to pickup the numfiles using he throttling capabilities built in for that protocol and send those messages then to the OB thread.
    
    So there should not be a queue in the Recovery DB.
    
    Obviously the messages will not be delivered anywhere close to real time (but probably aren’t now if the system can’t keep up).
    
    This will consume disk space as the numfile messages build up so you need to make sure you have that capacity.
    
    Also tracing and auditing the flow of messages will be more challenging.
    
    If you have the need to use the Message Metadata, know it will disappear when the message goes to the numfile unless you imbed the metadata in the message (then place it back in the metadata upon Fileset Local Read.
    
    All of the above is just off the top of my head.
    
    The real proper solution in my mind is for the receiving system to be altered as necessary so it can keep up with the arrival rate.
    
    This reply was modified 5 years, 1 month ago by Jim Kosloskey.
    
    email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- May 28, 2020 at 2:19 am #116899
  Charlie Bursell
  Participant
  Do you have High Availability (HA) with auto fail over? That along with a good UPS system should solve most of your problems.
- May 29, 2020 at 4:58 pm #116931
  Ryan Boone
  Participant
  Jim Kosloskey wrote:
  
  So if the issue is some receiving system cannot keep up with the pace so that an interruption in connection takes you over the cliff. Has it been determined the receiving system has done everything that can be done to improve their performance so they can keep up? Do they need to receive all the messages (maybe if you can reduce the load they can keep up)?
  
  Thanks, Jim.
  
  Yes, we have several “send everything” interfaces, so ADT for multiple hospitals queue up quickly. One receiving system is having difficulty keeping up. But the larger issue is that it only takes a few hours for messages to queue to rdb-busting volumes if an external system is down or if we experience VPN issues. It would be nice to have true “disk-based queuing” capability once the threshold reaches a certain level, then back to normal under that level. The solution may be to load-balance using dedicated AIPs.
- June 1, 2020 at 9:43 am #116941
  Jeff Dinsmore
  Participant
  We have used a few techniques in situations similar to this.
  
  The first was with an interface that was augmenting data in messages with database queries to our Paragon DB (back when we were using Paragon…).
  
  If the Paragon DB was unavailable for some period of time – during an upgrade, for example – we needed to wait until it was available again to resume processing.
  
  The solution we developed was was to write all inbound messages to a SQLite DB. Then we would read available messages from that DB – sorted in order received – and send them if the Paragon DB was available.
  
  Under normal circumstances, the received message would be written into the SQLite DB, then immediately re-read from the DB, processed and then sent outbound – one in, one out.
  
  During a Paragon downtime, the messages would be queued until Paragon’s DB came back online. The queued messages would then be read from SQLite, processed and sent outbound.
  
  Another similar approach I’ve used in the past was to write all inbound messages to files – one per message – with appropriate naming so that they would sort in the order received.
  
  The files were named something like CCYYMMDDHHMMSSiii.txt – a datestamp with an iii index that accommodated multiple messages received in the same second.
  
  The outbound would then read the file names, sort them oldest to newest, then process and send them outbound.
  
  Both of these were done with Tcl.
  
  Jeff Dinsmore
  Chesapeake Regional Healthcare
- June 1, 2020 at 12:58 pm #116946
  Ryan Boone
  Participant
  Thanks, Jeff.
  
  How many messages did you queue in the SQLite database at the most? That is something we are considering but would need to volume test it.
  
  We did write individual messages to file, but at the pace of the messages being written, regardless of whether they were written/retrieved by alpha/numeric or timestamp, the order was not always guaranteed. That did allow us to regulate the volume successfully, and was generally okay for ADT, but for SIU message pairs sent from Epic discovered they could be retrieved by the fileset-local out of order.
- June 1, 2020 at 1:32 pm #116947
  Jeff Dinsmore
  Participant
  I don’t recall the message volume, but SQLite performance has not been a problem for us.
  
  One of our current SQLite databases is running around 700 MB with nearly four million rows in its largest table.
  
  What kind of message volume do you need to support?
  
  Since SQLite DBs are a single file, you’d be constrained by the size supported by your OS. You could, of course, build a solution that supports multiple SQLite DB files, but that’s a bit more complex.
  
  Again, my file-based solution was completely Tcl based, so could control exactly how the listed files were sorted.
  
  I’ve not used fileset-local, but I’d expect it may allow running of custom code that returns the list of files found. If so, that would allow you to control the order.
  
  I would recommend starting with the SQLite solution. I think that’ll be your best bet.
  
  Jeff Dinsmore
  Chesapeake Regional Healthcare
- June 1, 2020 at 2:41 pm #116948
  Ryan Boone
  Participant
  Thanks again, Jeff. 4 million records would definitely be enough and then some (running on AIX). Some of our ADTs are quite large but it sounds like there’s a lot of room there for a large volume. I can always test to get a message count / size comparison. We do have TCL-based SQL interfaces as well, so that’s where we’ll start.
  
  Regards,
  
  Ryan
- June 9, 2020 at 9:08 am #117055
  Ryan Boone
  Participant
  Using the directory parse TCL in fileset-local works. However, when there are over 100k messages in the inbound directory, it can take a while for the thread to scan (list) the directory contents, and the sheer number of filenames is a lot to read into the script and sort (especially if you have longer filenames). To remedy this, configure the directory parse script to read 500 kbytes (or whatever is reasonable), sort the filenames, then remove the last one in the list (because it won’t be a full filename after reading in a specific number of kbytes). The thread will pick up that file on the next scan. Then set the scan interval to a lower value (like 5) — as the thread will not scan the directory while it’s reading data anyway, but once it’s finished processing you don’t want to waste time before the next scan is executed.
  
  Adding a sequence number to MSH-13 and validating messages as they are read into the fileset-local inbound thread indicates the messages are in order, and none are missed.
Author

Replies

Viewing 9 reply threads

You must be logged in to reply to this topic.