Background Info
Let’s use the following as an example set-up
site 1:
adt_in —> js3_adt
site 3:
(process_hub)
jr1_adt —> hs_app1_adt
—> hs_app2_adt
(process app1)
hr_app1_adt —> app1_adt_out
(process app2)
hr_app2_adt —> app2_adt_out
This all started because an ill-advised backload of data in Production feeds (requested by app1’s users) led to an overflow of HL7 messages pounding app2, which cannot handle ADT with discharge dates previous to the current date. This caused all of these messages to error and slowed down the processing of that queue, eventually backing up that queue into the previous process, one of the hubs on the site. Then that hub process eventually was so affected that its queues began backing up into the previous process.
In an attempt to stop the madness, it was decided to put into place a comparison between the discharge date and today’s date and surpress the messages with the old dates, at least temporarily, until the surge was over. A call to a tclproc called getToday was put into the pre-proc. However, it was incorrectly coded:
set dtToday getToday( )
instead of
set dtToday [getToday]
…and this caused tcl call out errors in the process hub.
Eventually, the recovery and error dbs were pretty much full, the site was beginning to hang, and there was a db vista error. So, we went through the steps of bringing everything in the site down, dumping the dbs to files, etc, etc. However, when we tried to bring the site up, it didn’t want to come back up. The first two times, it went straight to a db vista -921.
The Issue:
The third time, everything looked good in the GUI. I mean, all the icons were green and they said “up”, when I looked in the logs, data “appeared” to be processing.
However, this is incorrect. Messages were writing inbound into the processes but never writing outbound. In the log files, you could see the the message as it arrived inbound and the message as it was processed through all the pre-procs (b/c I insist that all of the tclprocs write to the log to indicate success or failure) but none of the outbound procs were being called — save_ob_msg, validate_reply, or resend_ob_msg — on the outbound threads.
This occurred in multiple ADT processes on site 3 for longer than 24 hours and the nightly cyclesave “bounce” of the threads did not resolve the issue. The entire processes had to be reloaded/”bounced” to resolve the issue. All of the messages had to be manually resent from the SMAT in site 1. They were not in the recovery database, nor where they appropriately stored in the SMAT in the site 3 processes during that time since they were re-init.
More bizarely, this behavior was not consistent. The 8 ftp processes on the site continued to work correctly and all non-ADT processes continued to work fine.