Ack help needed

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Ack help needed

  • Creator
    Topic
  • #54917
    Mike Strout
    Participant

      We have a system that we have been consistently sending results to for a while, but in the past week, it is constantly getting backed up. Restarting the thread clears the backup, but the backup starts again pretty quickly.

      I started saving the replies to SMAT and I don’t know what to make of the results. When the thread turns red, I pull up the OB SMAT for the thread and the SMAT for the reply. I searched the reply SMAT for the last message that is backed up, which is easy to identify because there are several copies of it, and could not find the matching reply.

      Then I restart the thread, get the reply SMAT again and this time the reply is in there. The time it is written to SMAT coincides with the time I bumped the thread.

      My config is pretty straight forward…

      On Inbound tab, Outbound Only is checked.

      On the Outbound tab, Await replies is checked with a timeout of 60. Timeout handling is Resend. TPS Inbound Reply is check_ack. TrxID is HL7

      Any thoughts?

    Viewing 10 reply threads
    • Author
      Replies
      • #83441
        David Barr
        Participant

          It sounds like the remote side is getting stuck and not sending an ack back. Could it be that the ack they are eventually sending back is because they get another connection and the original message is resent? There could also be a firewall or something interfering with the communication.

          Sometimes I run tcpdump to look at exactly what’s being sent and received for these connections. You can use Wireshark to view and analyze the packet capture files.

        • #83442
          Mike Strout
          Participant

            I am still wrestling with this lack of ack issue and haven’t had a chance to dig in with a protocol analyzer yet. As this is an AIX box, things are a bit more complicated than they would be on a Windows box.

            I do have a fundamental question about the ack process. My expectation is that the following should happen.

            1. Message sent, Cloverleaf puts message into Recovery DB and waits for reply.

            2. No ack is received in timeout period so, resend message, restart timer, delete original message from recovery db and wait for reply.

            3. Continue in this loop until ack is received or thread is restarted.

            4. If ack is received, expire timer, write message to SMAT, and remove message from recovery db

            5. If thread is restarted, all timers are expired. Upon restart, all messages in the recovery DB are sent using the same ack processing above.

            Assuming this is close to right, what I don’t understand is why I am getting multiple copies of messages written to the OB SMAT. My impression was that only messages that are successfully ack’ed are written to SMAT. If this is correct, then the problem must be in step 2 above where the previous copy of the now resent message is deleted from the recovery db.

            I heard a while back that the standard check_ack proc had some issues and that there was a new version that works better with Cloverleaf 6+. Can anyone confirm that and point me to it?

          • #83443
            Jim Kosloskey
            Participant

              Messages are written to the OB SMAT on an outbound hread when successfully sent not when an ack is received.

              So if you reach timeout and you are resending another OB SMAT message will appear and so on untill an ack is received.

              Unless you have some code to match ack to original sent message ANY reply will be treated as the correct reply to the message sent.

              My guess is you are reaching timeout and doing a resend.

              email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

            • #83444
              Mike Strout
              Participant

                Yes, it is pretty clear that I am reaching the timeout. Shouldn’t the previous version of the message be removed from the recovery database as part of the resend process?

              • #83445
                Jim Kosloskey
                Participant

                  Well I am not sure how you are configured but if you are controlling everyhting yourself with Tcl rather than using the inherent resend then you should be making a copy and that copy lingers until you get rid of it (probably upon receipt of ack or resend – where the copy is sent and another copy is made).

                  I suspect the imbedded proceess does the same thing. So the copy will be in the recovery DB until an ack is received. And that is the way you want it I suspect.

                  email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

                • #83446
                  Mike Strout
                  Participant

                    Yes, I am using the standard processing and the check_ack script. It is just strange to me that I would be seeing multiple copies of the message sent outbound in the OB SMAT.

                  • #83447
                    Jim Kosloskey
                    Participant

                      Let’s say you have exceeded the timeout 5 times and you have resend configured, wouldn’t you want SMAT to reflect exactly how many messages Cloverleaf sent?

                      I do – then I can easily determine when I am missing the timeout and perhaps either reset the timeout (if the receeiving system just takes longer to respond) or get the receiving system to fix its acknowledgment process.

                      Many receiving systems cannot even tell they have receeived many attempts at the same message (they only see the one they acknowledged) and thus deny there is an issue. By having all of the resent messages in the OB SMAT I can show the receiving system the number of resends it took and the time interval between messages. Usually that then convinces them to take a closer look.

                      I find it very helpful when getting initial connectivity settled out with a receiving system.

                      It is also an accurate representation of the number of messages Cloverleaf has sent to the OB destination giving a more accurate load data.

                      email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

                    • #83448
                      Elisha Gould
                      Participant

                        Assuming the system is in the same network and not over a VPN:

                        I’ve had this issue with some applications before.

                        The likely issue is that they are not reading the socket correctly and assuming everything is in one packet rather than split across multiple packets. The result is the message is only half received and they get into a dodgy state. Its likely that if the system is written like this, they don’t log properly either, so it just sits there pretending all is good.

                        There’s two options for this:

                        Get them to fix their code so that it handles the reading and erroring correctly.

                        Write up a proc that closes the connection when the timeout expires.

                        To do it, check “Use DRIVERCTRL control” in the protocol properties.

                        In the Timeout Handling Reply Generation add a proc to create a close message and use the PROTO disposition on it, PROTO on the OB message id and KILL the message id.

                        Code:

                        proc gen_code_ob {args} {
                           keylget args MODE       aMode
                           keylget args MSGID      aMsgId
                           keylget args CONTEXT    aContext
                           keylget args ARGS       aArgs
                           global HciConnName

                           if {[string equal $aMode start]} {
                               return {}
                           }

                           if {[string equal $aMode run]} {
                               set myDispList {}

                               echo “$aMode $aMsgId $aContext”
                               switch $aContext {
                                   reply_gen {

                                       set myMsgId [msgcreate]
                                       set myDriverCtl {}
                                       keylset myDriverCtl CLOSE 1
                                       keylset myDriverCtl WRITEZERO 0
                                       msgmetaset $myMsgId DRIVERCTL $myDriverCtl
                                       lappend myDispList “PROTO $myMsgId”

                                       lappend myDispList “KILL $aMsgId”
                                       if {[keylget args OBMSGID myObMsgId]} {
                                           lappend myDispList “PROTO $myObMsgId”
                                       }
                                   }
                               }
                               return $myDispList
                           }

                           return {}
                        }

                      • #83449
                        Charlie Bursell
                        Participant

                          It seems to me your major problem is *NOI* SMAT but why you are re-sending.

                          First, check your timeout.  I have had to set it as high as 300 seconds because of systems that want to do a database post prior to sending an ACK.

                          Most important, open lines of communication with the vendor on the other side.  What is he seeing?

                          Get a SNIFFER!  It will always be a he said they said until you can get definitive proof of what is really happening.

                          It is easy enough to write a script to remove duplicates from SMAT.  As Jim said, some people find them useful

                        • #83450
                          David Barr
                          Participant

                            Doing a packet capture on AIX shouldn’t be too difficult. I tried this on Linux, and I think it works the same on AIX.

                            You need to su or sudo to root from the command line, then run a command like this:

                             tcpdump -s 0 -w myfilename port XXXX

                            Replace XXXX with the port number that your interface is using. Let the command start running, then send a message through the interface. After a few seconds you can type ^C to interrupt the tcpdump command.

                            Copy the file that you captured over to a Windows PC and open it up with Wireshark. You should be able to right click on one of the packets in the display and select “follow TCP stream”. This will open a new window that will show all of the bytes that were sent and received.

                          • #83451
                            David Barr
                            Participant

                              Another simple thing to check (if you haven’t done this already) is your communication settings. We usually use pdl-tcpip protocol, and on the protocol properties page we use “mlp_tcp.pdl” as our PDL.

                          Viewing 10 reply threads
                          • The forum ‘Cloverleaf’ is closed to new topics and replies.