Duplicate State 14 Messages

This topic has 19 replies, 8 voices, and was last updated 17 years, 3 months ago by Rob Abbott.

Creator

Topic
February 27, 2008 at 7:06 pm #49864
Mary Kobis
Participant
We just implemented the last of our sites to Version 5.6 (AIX 5.3). We did hit a fixable snag in which two duplicate state 14 messages are in the recovery DB at the same time. Not really worried about this in past versions as there was never any problems to look at the recovery DB for state 14 messages. I’m pretty confident that we have recover_33 in place correctly and I wouldn’t think it is normal to have two duplicate state 14 messages for a thread in the recovery DB? Or am I wrong?

I found this situation when I tried to cycle a thread that was connected but had multiple pending messages that were not processing. Then engine died when the thread cycled back up. The panic error was:

PANIC: Thread panic—engine going down

PANIC: assertion ‘(ptd)->ptd_msg_delivered_ok == ((Message *) 0)’ failed at protocolThread.cpp/825

I sent the panic off to Support (Thanks Dave) and they confirmed what I did, removed the State 14’s out of the DB and restart. We are doing ok today.

Thanks in advance,

Mary.
Creator

Topic

Viewing 18 reply threads

Author

Replies
- February 27, 2008 at 7:16 pm #63909
  Todd Lundstedt
  Participant
  Aaaah.. the old multi-state-14 problem. If things are working correctly, there should only be one state 14 message in the recovery DB for a particular thread at a time, and not for very dang long, either.
  
  We fought that at our site for YEARS… had procedures to check for it prior to process startup and everything.
- February 27, 2008 at 7:40 pm #63910
  Mary Kobis
  Participant
  I’m only seeing this during the day when message throughput is at its highest. It calms down after hours and you don’t see the duplicates. These are outbound threads being fed by an SNA connection from the mainframe. Thanks for the info, Mary… oh.. we are on AIX5.2, not 5.3.
- February 27, 2008 at 8:22 pm #63911
  Tom Rioux
  Participant
  We are having a similar issue with a tcp/ip connection to Ichart. There are multiple state 14 messages that are lingering around in the RDB. It doesn’t create a problem unless we have to bounce the process and it panics on restart. We do have the recovery procs in place and so far this is the only thread that is having this issue.
  
  Another issue we are having on several threads is that even with the recover procs in place, somehow ACK messages are getting through and are treated as inbound data and not reply. Yes, we do have await replies on and the recovery procs in place. Not sure what else to check. It’s not causing a problem other than filling up the error database.
  
  Any suggestions for things to check?
- February 27, 2008 at 9:11 pm #63912
  Todd Lundstedt
  Participant
  Mary and Thomas,
  
  When we upgraded, to 5.5, we had some discussions with a few folks in Quovadx support and they delivered a newer set of SNA procs than we had. It’s been long enough now that I don’t recall the names of the old procs, but the new set of procs are all contained in one .tcl file, use parms to help identify connections, etc..
  
  I am reluctant to post the entire proc, because I don’t know if it is a charged-for item or not. But here are the upper comments from the proc. If you don’t have this proc set, you might want to contact your rep and/or support to see if you can get them.
  
  # RTIF – a series of procedures to send to mainframe using RTIF protocol
  
  # Normally SMS
  
  #
  
  # Required procedures:
  
  # StartSMS Protocol Startup Procedure
  
  # checkSMS Inbound Reply Procedure
  
  # resendSMS Reply Generation Procedure
  
  # sendokSMS Send DATA OK Procedure
  
  # writeSMS Pre-Write Procedure
  
  #
  
  #
  
  # Optional procedures:
  
  # deallocSMS TPS OB Procedure
  
  The comments go on for a LONG time.
  
  Good luck!
- March 13, 2008 at 2:01 pm #63913
  Max Drown (Infor)
  Keymaster
  ~~Thomas Rioux wrote:~~
  
  Another issue we are having on several threads is that even with the recover procs in place, somehow ACK messages are getting through and are treated as inbound data and not reply. Yes, we do have await replies on and the recovery procs in place. Not sure what else to check. It’s not causing a problem other than filling up the error database.
  
  Any suggestions for things to check?
  
  I’m seeing this as well.
  
  -- Max Drown (Infor)
- March 13, 2008 at 2:55 pm #63914
  Jim Kosloskey
  Participant
  Tom and Max,
  
  The few times I experienced this the issue was with the receiving system.
  
  In the scenario I experienced, the receiving system was sometimes sending 2 acks in a row. The first ack is treated as expected since the ‘Await reply’ switch was thrown by Cloverleaf(R). The second ack is then treated as ‘DATA’ rather than ‘REPLY’ because when the ‘Await Reply’ switch is thrown and a message is received (the first ack) the switch is thrown off and the first ack message is labelled as “REPLY. That is how Cloverleaf(R) knows this is a reply – that is the ‘Await Reply’ switch is thrown. Now the second ack arrives and the switch is off so any message inbound on this outbound thread is now labelled as ‘DATA’.
  
  Since the second ack is ‘DATA’, Cloverleaf(R) attempts to route it and I am betting you are getting routing errors – at least that is what I recall seeing.
  
  Since this does not happen all the time it is difficult to prove to the receiving system what they are doing. SMAT for the inbound messages can assist in troubleshooting.
  
  Obviously getting the receiving system to fix their problem (if it is their problem) is the best way to address this.
  
  However, I suspect you could eliminate the errors if they are routing errors by routing all inbound messages on the outbound thread back to the outbound thread and killing them. But that is attempting to cure the symptom not the disease and would not be my preferred way of proceeding.
  
  Jim Kosloskey
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- March 14, 2008 at 1:08 pm #63915
  Michael Hertel
  Participant
  Isn’t checking “outbound only” supposed to treat all incoming data on an outbound thread as reply data?
- March 14, 2008 at 1:42 pm #63916
  Jim Kosloskey
  Participant
  Michael,
  
  In the scenario I described I had the ‘Outbound only’ box checked.
  
  It has been a while since I experienced the scenario I described but as I recall the ‘Await Reply’ switch being thrown was the determining factor based on what I saw in the log with the noise level all the way up.
  
  Of course, there could have been a Cloverleaf(R) bug in the release when I experienced the double acks. The double acks were the receiving system’s problem and when the receiving system corrected all was well.
  
  I think the key is if it is a routing error one gets on the ack – that is a pretty good indication the ack was treated as ‘DATA’.
  
  I think that might be shown on the full-length display of the message from the Error DB which includes the metadata.
  
  Jim Kosloskey
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- March 14, 2008 at 2:59 pm #63917
  Rob Abbott
  Keymaster
  ~~Michael Hertel wrote:~~
  
  Isn’t checking “outbound only” supposed to treat all incoming data on an outbound thread as reply data?
  
  Only if the engine is in “await reply” state. If the engine is not waiting for a reply, any messages coming in will be discarded if “outbound only” is checked.
  
  Rob Abbott
  Cloverleaf Emeritus
- March 14, 2008 at 3:39 pm #63918
  Jim Kosloskey
  Participant
  Rob,
  
  Thanks – that jogged my memory.
  
  The situation I experienced was when there was a bug in Cloverleaf(R) wherein the timing of the “Wait reply’ switch being reset in relationship to and inbound message was such that there was a sufficient window of opportunity that if the receiving system sent back to back acks sometimes the second one got through and was treateed as ‘DATA’.
  
  I recall now that bug was fixed a long time ago.
  
  So if the acks are getting errored for routing issues, I would make sure the ‘Outbound only’ box was checked on the thread definition.
  
  Of course, if it is not checked, you at least can see that the receiving system may have what I would consider a problem if indeed it is sending 2 acks for one message periodically.
  
  Jim Kosloskey
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- March 25, 2008 at 3:08 pm #63919
  Max Drown (Infor)
  Keymaster
  Work around for the duplicate state 14 messages: https://usspvlclovertch2.infor.com/viewtopic.php?t=2640
  
  -- Max Drown (Infor)
- March 25, 2008 at 4:03 pm #63920
  Mary Kobis
  Participant
  We have it in place and working… Mary.
- March 25, 2008 at 4:07 pm #63921
  Max Drown (Infor)
  Keymaster
  Any thoughts on how to test the fix? In my case, the receiving app has already fixed their problem, so I’d have to find a way to recreate the duplicate state 14 on my own.
  
  -- Max Drown (Infor)
- March 25, 2008 at 4:58 pm #63922
  Rob Abbott
  Keymaster
  Create a thread with whatever recovery configuration you want to test (built-in, recover_56, recover_33). Have the thread send a message to a test port (hcitcptest listener, maybe).
  
  Shut the thread down while it’s waiting for a reply.
  
  If things are configured correctly, you will have 1 message in state 14 for that thread. If you’re using recover_33, you will see 2 messages in state 14 for the thread.
  
  Hope this helps.
  
  Rob Abbott
  Cloverleaf Emeritus
- March 26, 2008 at 2:14 pm #63923
  Bob Richardson
  Participant
  Greetings,
  
  Just for planning purposes and future compatibilty: if we deploy the new recover_56 procs to replace our current recover_33 implementation and then apply the future REV1 patch to the CIS5.6 engine, will we need to de-install the recover_56 procs? It appears that we should not have to do this but am interested in avoiding unncessary work as we have about 150+ outbound threads with recovery_33 now in our existing 5.3 engine.
  
  [We have planned to move to 5.6 this year].
  
  Please confirm and thanks!
- March 26, 2008 at 3:21 pm #63924
  Max Drown (Infor)
  Keymaster
  ~~Rob Abbott wrote:~~
  
  Create a thread with whatever recovery configuration you want to test (built-in, recover_56, recover_33). Have the thread send a message to a test port (hcitcptest listener, maybe).
  
  Shut the thread down while it’s waiting for a reply.
  
  If things are configured correctly, you will have 1 message in state 14 for that thread. If you’re using recover_33, you will see 2 messages in state 14 for the thread.
  
  Hope this helps.
  
  Here’s how I simulated the test.
  
  01) I created 4 mlp tcp/ip threads, test_send –> [test_in-raw->test_out] –> test_recv
  
  ~~http://www.planetdrown.com/images/cloverleaf_dup14_01.jpg" />~~
  
  02) I configured test_out for Resend OB Data and check_ack from recover_56. Other than check_ack, I didn’t use any other recover_56 proc.
  
  03) I configured test_recv to not send any ACKs.
  
  04) I sent 87 hl7 messages to ob_pre_tps on test_send. Here is a snap shot of the database.
  
  Code: Total messages pending in the site’s Queue: 89 Messages Status Source Target ========= ========================== ======================= ======================= 1 16-PR2229 unbacked queue test_in test_out 1 14-OB delivered test_in test_out 87 11-OB post-SMS test_in test_out Total messages in the error database: 3 Messages Status Source Target ========= ==================================== ======================= ======================= 3 101-Unsupported Trxid test_recv
  
  05) I brought down test_out. There was no change in the database snapshot.
  
  06) I brought up test_out. As expected, the process did not panic (as there was no duplicate state 14 messages). There was no change in the database snapshot.
  
  07) I then configured test_out for sendOK_save and resend_ob_date from recover_56 and conducted the same test. Here is the database snapshot.
  
  Code: Messages Status Source Target ========= ========================== ======================= ======================= 1 14-OB delivered test_out test_out 87 11-OB post-SMS test_in test_out Total messages in the error database: 2 Messages Status Source Target ========= ==================================== ======================= ======================= 2 101-Unsupported Trxid test_recv
  
  08) I observed the same results with recover_56 scripts. No panic. No duplicate state 14 messages.
  
  ~~Did I conduct the test properly? Is there a better way to do it?~~
  
  -- Max Drown (Infor)
- March 26, 2008 at 4:06 pm #63925
  Rob Abbott
  Keymaster
  ~~Robert H Richardson wrote:~~
  
  Greetings,
  
  Just for planning purposes and future compatibilty: if we deploy the new recover_56 procs to replace our current recover_33 implementation and then apply the future REV1 patch to the CIS5.6 engine, will we need to de-install the recover_56 procs? It appears that we should not have to do this but am interested in avoiding unncessary work as we have about 150+ outbound threads with recovery_33 now in our existing 5.3 engine.
  
  [We have planned to move to 5.6 this year].
  
  Please confirm and thanks!
  
  We will do our very best to make any fix compatible with the workarounds in the technical bulletin (including recover_56).
  
  But your best option for migration is to change your outbound threads to use the automatic resend feature and remove any sort of recover procs. This would involve changing thread configuration and any IB RELY “check ack” procedures you have.
  
  Rob Abbott
  Cloverleaf Emeritus
- March 26, 2008 at 4:26 pm #63926
  Max Drown (Infor)
  Keymaster
  ~~Rob Abbott wrote:~~
  
  This would involve changing thread configuration and any IB RELY “check ack” procedures you have.
  
  check_ack is not needed?
  
  -- Max Drown (Infor)
- March 26, 2008 at 4:55 pm #63927
  Rob Abbott
  Keymaster
  “check_ack” type procedures are only needed if you are validating the reply and want to do things like resend the original OB message based on AR or AE.
  
  If you are simply killing the reply, you can use hcitpsmsgkill in IB TPS. No check_ack type procedure necessary.
  
  This applies to 5.6. If you are on an earlier release you need a kill procedure that does two things- kill the reply and clean up the saved OB message.
  
  Rob Abbott
  Cloverleaf Emeritus
Author

Replies

Viewing 18 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.