Site A has processes A1 & A2.
Process A1 has thread 1 that sends messages to thread 2 in Process A2.
Site B has processes B1 & B2.
Process B1 has thread 3 that sends messages to thread 4 in Process B2.
1.Saturday morning: Process A2 panicked and shut down – log file shows no reason. A2 was not started back up until Monday morning.
2.For the next 24-30 hours about 80 messages queued up in recoveryDB in process A1 waiting for A2 to come back up.
3.Sunday afternoon: thread 4 in Process B2 received those 80 messages. Those messages never went through Process B1 or thread 3. The messages were Xlated with the route in process A1, but went through the OBTPS of thread 4.
4. Those 80 messages stayed in Site A’s recovery DB, until Monday morning when Process A2 was restarted, and A1 cycled.
Basically, it looks like the messages went through A1’s Xlate, but were sent to the wrong OB Pre-TPS queue – of the interface in a completely different site.
The log files do not have any “resend” commands, and there was no one working who would have resent with Smat, or dumped from RDB and sent to the other thread.
The log file of Process B2 shows the messaging being sent out, and their metadata looks like they came directly from Process A1. The source & destination threads are 1 & 2, NOT 3 or 4.
Process B2 logs this error for each message, I assume because its trying to delete from RDB a message that was never there:
11/25/2007 17:46:36
[dbi :dbi :ERR /0:23788_ob_23res] [0.0.27222925] dbiWriteLogMsg: mid doesn’t exist
11/25/2007 17:46:36
[dbi :dbi :ERR /0:23788_ob_23res] [0.0.27222925] dbiWriteLogMsg: mid doesn’t exist
11/25/2007 17:46:36
[dbi :dbi :WARN/0:23788_ob_23res] [0.0.27222925] Requested to delete non-existent mid
Here’s where it gets interesting, the message ID is from the mid number wheel of Site B, but the OriginalMID is from the number wheel range of Site A.
Has anyone ever seen anything like this? All signs point to the ICL thread as the culprit, but there’s no real way to diagnose that. Any other suggestions of where to look?
EDIT: discovered the panic on Saturday was caused by a different thread in the process – unrelated to any of these routes or threads.