CIS 20.1 PROTOCOL:tcpip mllp2 issue with large messages

Clovertech Forums Cloverleaf CIS 20.1 PROTOCOL:tcpip mllp2 issue with large messages

  • Creator
    Topic
  • #119807
    Jeff Dawson
    Participant

      Hello,

      Incase anyone else may be running CIS 20.1.1.3, we are running this on AIX 7.2 TL 4.  Currently we are in the upgrade process and started validating by sending messages through our new environment. While sending Epic PDF and RTF messages we noticed messages were getting stuck between threads that used protocl:tcpip encapsulation mllp2 with a default timeout of 30 seconds for internal interface traffic or if we need to route messages from one site to another.  The pdf in question we found stuck holding up traffic in the client thread was a size of 6MB which doesn’t seem like an unreasonable size to me, we tried cycling the process, cycling both sides of the client/server threads however nothing seemed to work.  Besides recieving PDF from Epic we have quite a few other systems that send our engine PDF’s, Epiphany, Digisonic, Varian, Sunquest Copath reports are a few examples.

      Below is some of the logging we enabled to see if we could get a look at what was going on behind the scene.

      [tcp :wrte:DBUG/0: TSO_trans_pb:06/13/2022 12:47:50] Start MLP v2 reply wait
      [tcp :read:ERR /0: TSO_trans_pb:06/13/2022 12:48:20] Tcp MLLP2/USER2 send timed out waiting for reply.
      [pd :pdtd:INFO/1: TSO_trans_pb:06/13/2022 12:48:20] Executing callback function for writing partial message
      [pd :pdtd:INFO/1: TSO_trans_pb:06/13/2022 12:48:20] [0.0.112195] Writing message completed
      [pd :thrd:INFO/0: TSO_trans_pb:06/13/2022 12:48:20] [0.0.112195] Requeuing undelivered message
      [msg :Msg :INFO/0: TSO_trans_pb:06/13/2022 12:48:20] [0.0.112195] Updating the recovery database
      [dbi :rlog:INFO/1: TSO_trans_pb:06/13/2022 12:48:20] [0.0.112195] Update msg in recovery db to state OB post-SMS
      [pd :thrd:INFO/1: TSO_trans_pb:06/13/2022 12:48:20] OB-Data queue has 1 msgs

      We ended up opening a ticket with Infor and have been working with support, they had me change the timeout setting under Data Options > Encapsulated > Configure > Timeout from 30 seconds to 120.  *Note before we tried bumping this up to 60 and 90 seconds with no luck then infor recommended 120.  After making this change the message was finally sent but after a few minutes.  Also depending on the system the message could also be sent to another pair of threads to avoid process to process communication that used this same mllp2 configuration which adds another few minutes to the delay.

      Back on CIS 6.0 and previous QDX versions we used to use PROTOCOL:pdl-tcpip PDL: mlp2_tcp.pdl but had switched this out per Infor’s recommendation to use the protocol:tcpip.  I tried switching back to the PDL protocol and confirmed the 6mb PDF was sent in a few seconds, we tried a 44mb PDF and it took around 20 seconds but figure some of that time the message needs time to read in.

      We met with Infor and they went over another option of inter-site routing which also appeared to work but had some issues on the setup and experienced quite a few lengthy delays when trying to setup this option when going through the configuration points in CIS 20.1.  The CIS 20.1 help document for inter-site routing I feel needs to be updated to state what specifically needs to be cycled to get this working host server, monitor daemon, etc still unclear.  We worked through a few ICL thread errors which not for sure why they occurred.  This is another option that could be used for site to site communication however we still would have the issue of any tcpip threads being used to avoid process to process communication within a site, which we use quite heavily in our interface engine.

      Infor is going to escalate this issue and test this out with a few large PDF messages to see if this is an application defect.  One other note we currently are running CIS 6.2.6.1 and this issue does not exist even with protocol:tcpip mllp2 set with a timeout of 30 seconds where in CIS 20.1 the message would stay stuck in the client thread.

      Jeff

    Viewing 2 reply threads
    • Author
      Replies
      • #119811
        Don Martin
        Participant

          Hi Jeff,

          Thanks for the post on the PDF issues you’re experiencing.  We’re about to migrate from 6.2 to 20.1.1.3 on AIX, and we handle a lot of PDFs in a similar fashion to what you describe.  Currently we use PROTOCOL:pdl-tcpip PDL: mlp2_tcp.pdl and often send these messages across different sites.  We’ve will have occasional issues with being unable to allocate memory, especially for for PDFs larger than 40 MB in size, but overall we’re able to process PDFs without too many issues.

          I’m very interested in hearing more about Infor’s resolution for this PDF issue, and how their solution is working for you.  We’ll add updates once we get 20.1.1.3 installed on a test server, and hope you’ll continue to do the same!

          Thanks again,

          Don Martin

          Sanford Health

          • #119813
            Jeff Dawson
            Participant

              Infor released Patch CIS 20.1.2.0 last week and we went ahead and installed this patch as it looked like it had quite a few fixes in it.  The first thing I tested was switching our PDF connections back from the protocol pdl to the internal protocol:tcpip with a timeout of 30 seconds (default) and from this short testing it looks like everything was working correctly once again.  No messages were timing out and hanging in recovery and performance time from site to site communication and internal site communication was looking good while trying out a a few different sizes of pdf’s around 11mb.

              After this bit of testing we were going to continue testing the new CIS 20.1 version with some message volume.  I started the engine using the perl auto start scripts located under $HCIROOT/auto-start and about three quarters of the way through the script started getting forking and swap space errors to the point the entire system locked up having to hard boot the system.   We are running brand new IBM Power 9 servers with 96GB of physical memory and 6GB of swap space.    Up to this point we have been testing CIS 20.1.1.3 without running in to any memory constraints,  these start/stop scripts had been ran numerous times without any issue while on CIS 20.1.1.3.  I uninstalled the 20.1.2.0 patch and reran the engine start scripts without any issue just to have a base line, we never ran out of physical memory during this process.

              Our AIX admins increased swap space from 6GB to 32GB, I then reinstalled CIS 20.2.1.0 and reran the engine start script which completed but noticed all physical memory was used and up to 55% of Swap was used even after the start was complete.  Comparing hci process PID’s pulling a top 15 list in memory size everything appears to be consuming one quarter more of memory and it seems there is some major memory consumption issue going on with this Patch.

              While typing this update I heard back from Infor

              “R&D states that they are working on what may be a related issue where the intsaller problem was reported, according to the installer log, we suspected that installer might not fulfill the patch installation of Cloverleaf 21.1.2.0.
              That may introduce the memory problem reported in this case. Please hold on the taste of 20.1.2.0 AIX, use 20.1.1.3 instead now”

              As of right now I’m not for sure what direction we are going to take, one thing I noticed with 20.1.1.3 is the GUI just seemed sluggish to me, i.e. from opening the testing tool to other various screens we use on a day to day basis.  From the limited time we had on 20.1.2.0 it was feeling like some of those minimal load times were getting better.

               

              Jeff

              • This reply was modified 2 years, 7 months ago by Jeff Dawson.
              • This reply was modified 2 years, 7 months ago by Jeff Dawson.
              • This reply was modified 2 years, 7 months ago by Jeff Dawson.
          • #119817
            Don Martin
            Participant

              Thanks Jeff.  Again, appreciate the info.  We’ll add to this once we get an upgraded version installed.  At this point we’re looking at trying to upgrade sometime around a month from now.

              Don Martin

              Sanford Health

            • #119899
              Gotham Mullaguru
              Moderator

                Hi Jeff,

                 

                Thanks for reporting this issue on AIX and testing/validating the fix. I’ll be posting the release of hotfix for AIX.

            Viewing 2 reply threads
            • You must be logged in to reply to this topic.