Garry,
OK this is what I understand:
Messages are flowing…
Virtually no delay between sending of message and receipt of reply…
The thread is configured to time out at 10 minutes (600 seconds) while waiting for a reply and when the timeout does occur, just send the next message (do NOT resend the message for which a reply is pending). The thread is also configured to try to reconnect every 5 seconds if it becomes disconnected.
Suddenly the error expressed earlier occurs…
It now appears a reply does not arrive within the 10 minute wait period (or more)…
With the above in mind,
Does the log indicate the engine has attempted to reconnect (or do you not have the ‘noise’ level up that high)?
Doyou have SMAT for both outbound and inbound on the outbound thread? If yes, do you have a way to match up the replies with the messages sent (hopefully a unique Control ID in the MSH or something like that)?
Can you verify that every message you have sent actually was logged on the receiving system?
What I am thinking might be happening:
Messages flowing…
Conection severed by receiving system…
Thread goes down…
After 5 seconds, engine attempts reconnect…
Reconnection occurs – but – the receiving application is not alive…
Since waiting for a reply, wait for up to 10 minutes…
Since receiving application is not alive, no reply for 10 minutes (meanwhile more messages arriving and getting queued)…
Next message sent (means potentially the previous message is really ‘lost’)…
Wait another potential 10 minutes for reply (this could be the 20 minutes total observed)…
Time out occurs and next message is sent…
and so on until receiving application revives.
Of course that would mean that you should see the engine attempting to reconnect every 5 seconds after the error has occurred. If the engine noise level is sufficiently set.
If you happen to be physically monitoring when the error occurs, you should also see the thread temprorarily cycle down and then eventually back up. However you have indicated the thread just stays up.
I would also expect to see the count of messages out exceed the count of messages in. Again, with the SMAT for both messages and replies and a mechanism for tying the messges together, the pattern could be analyzed further.
If you want to email me directly, maybe we can get more specific.
Thanks,
Jim Kosloskey
email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.