Missing report messages related to use of localhost

Clovertech Forums Cloverleaf Missing report messages related to use of localhost

  • Creator
    Topic
  • #118519
    Jeff Dawson
    Participant

      Hello,

      I have an incident open with Infor but it was suggested by them to also post this topic to see if anyone else has ever come across this type of odd behavior.  For internal engine communication we protocol:tcp with MLLP2 encapsulation for the Host: setting we’ve always used localhost.  Per Infor support I’m working on changing all of our threads from localhost to to physical ip address of the server, which in our case would the cluster or service ip address.

      Our setup we have sites built out depending on hospital or if the ancillary system is big enough, it could be setup as a site i.e. we use Sunquest as our LIS and have a “lab” site.  We use Epic as our HIS so we have a  site that houses all of our connections going To Epic.  In this case we have one specific connection/endopoint going back to Epic’s interconnect server that receives different types of base64 encoded PDF’s but also Powerscribe textual radiology reports and Pacs image available messages.  For this specific process we have 17 threads.  1 Thread is the actual connection to Epic and the other 16 threads act as servers for internal engine communication from other CIS sites that have a corresponding client thread using local host and a unique port for each connection that needs to send data to this specific Epic connection.  (Side note we are looking at reducing this setup and just having 1 or 2 threads use multi-server to reduce the thread count)

      On 2/22/21 one of our analysts cycled this process using the CIS Gui, via the NetMonitor >  right clicking on the process in question > Control > Restart.  A little while later our interface team started getting critical tickets from 5 hospitals in our WI region stating missing Powerscribe radiology report data.  When I logged in with a few other analysts data was being received in our engine from the initial ancillary system connection, Powerscribe in this case, which we confirmed by looking in SMAT.  Powerscribe was then sending data from the hospital site to the Epic site in question however when i looked at the two threads (client and server) both were in an “up” state however the last written time on the server thread had not updated for over an hour.  I confirmed 100’s of Powerscribe reports that were supposed to go the endpoint never made it to the actual Epic end point connection, from what I can tell the client/server internal engine setup was behaving like a thread setup to file /dev/null or going to the bitbucket.

      I started checking other threads (that act as a server) in this proccess and all the others were behaving the same way.  Looking back I wish I would have contacted Infor support at this time or first turned on enable all debugging.  Knee jerk response was to issue an hcienginestop on the process then hcienginerun, after doing so we immediately started receiving data from all the different CIS sites that send to our Epic site.   Our team then started going through system by system to look at what was received at the system level vs the SMAT file on the thread that sends to Epic to find out what was not delivered.  Overall it was 30+ systems resulting in hundreds of messages being resent due to them missing or being lost in these tcp/ip connections.

      To me it looks like when the “Restart” command was issued from the CIS Gui it seems like something socket related was not closed correctly or the process was in some type of hung state.  The odd part to me is if there was a connectivity all of the messages should have queued at the client thread level not making to the server thread side.

      I’d like to list a few things we checked

      1. Port range all of these 16 client/server threads use ports in the 14k/15k range
      2. Ports are unique and not used by another Cloverleaf threads each of these 16 threads have their own port.
      3. Filtering Procs, no filter procs are in place in our Epic site, filtering would be done at the originating site level.  After issuing the hcienginestop/hcienginerun on the problematic process all the missing reports were resent at the originating site level ruling out any filtering changes/issues from the client side.

      I’ll paste in the next reply the only logging I could find at a process level, which is interesting the process dumped this out to $HCISITEDIR/exec/errrors.  I’ve only seen that happen when a core dump occurs at the process level which in this case only this log and a startup log no core file was found.

      I’d also like to share what Infor provided from the incident, relating to #4 connection refused errors this appears to have occurred during the time the “Restart” command was issued.  Here are a few of the logging entires from the client side, all of these 16 client threads had this logging message but I still feel like on startup something at a socket level wasn’t right since the logging did go away.

       

      /cis/cis6.2/integrator/imaging/exec/processes/Tepic_icrpt/Tepic_icrpt.err:[tcp :open:ERR /0: TSEt1_icrpt:02/22/2021 18:27:20] Can’t connect to localhost:15230 – Connection refused

      /cis/cis6.2/integrator/wwd/exec/processes/Tepic_rad/Tepic_rad.err:[tcp :open:ERR /0: TSEt1_icrpt:02/22/2021 18:27:05] Can’t connect to localhost:15825 – Connection refused

       

      Looking over the logs, The only issue it stated was pertaining to connection refused. Those are all network connection issues. In Support, we tend to tell our customers not to use localhost, but rather the physical IP address. Overall Cloverleaf was stating it couldn’t see/find the ports defined.

      Suggestions moving forward:
      1. I have sent a note to our Service Team to see if anyone can offer suggestions on handling large PDF files using base64 encoding. I will update the incident if I get a response.
      2. Instead of using localhost, use the IP address of the server.
      3. Should the issue occur again, Open the Network Monitor and set enable_all for the process in question. This puts the process in debug mode without having to bounce the process.
      4. If you continue to see connection refused errors, I would recommend getting your Network Team involved by having them run a sniffer on the port. They will be able to provide more information on why the port(s) are not connecting.

    Viewing 5 reply threads
    • Author
      Replies
      • #118520
        Jeff Dawson
        Participant

          Here is the log from $HCISITEDIR/exec/errors

           

           

          Attachments:
          You must be logged in to view attached files.
        • #118526
          Jared Parish
          Participant

            If I understand you correctly, I’d be interested to know what kind of ACK / reply logic you have setup between these internal threads?

            • This reply was modified 3 years, 10 months ago by Jared Parish.

            - Jared Parish

            • #118528
              Jeff Dawson
              Participant

                Hi Jared,

                Back in the day we use to use protocol:pdl-tcp using tcp_acknak.pdl for our internal engine connections but I don’t believe Infor doesn’t package those pdls anymore.  I thought it was stated by Infor there were some possible memory leaks or performance issues.  Many years ago when protocol:tcp was introduced to Cloverleaf we had a few analysts attending level 2 training and they suggested we switch from using pdl to using the built in protocol:tcp with MLLP2 encapsulation. From what I understand it functions the same sending a low level ack, The message is the ASCII value ACK (0x06) since we don’t care about AA, AE, AR for internal communication.

                 

                I threw on enable all to show what it looks like from a log perspective same start/end characters the hex 06 ack is all that’s there.  At this point I don’t think there’s much we can do as recreating this issue in Test has been unsuccessful until it possibly happens again where I’ve let the interface team know of some debugging steps we can run through before restarting anything.

                0b 06 1c 0d |….|

                write of 0x36d80754 with length 4

                Raw batch is 4 bytes

                Length sent is 4 bytes

                Sent 4 bytes

                Raw read with length 32768

                Wanted 32768; got -1

                pdTcpip Server Thread Read Done

                 

                I’ve been going through old Clovertech posts looking for localhost references, when this happened I wish I could have done a netstat on a few of the problematic ports before the process was stopped and restarted.  I know in our /etc/hosts file we do have loopback localhost defined.

                127.0.0.1 loopback localhost # loopback (lo0) name/address
                ::1 loopback localhost # IPv6 loopback (lo0) name/address

                 

                Some of the localhost entries:

                “We have LOCALHOST set up in our hosts file pointing to 127.0.0.1 as we should. However, our sys admin tells us that AIX by default first looks at the DNS to resolve names, then looks at the hosts file. This has not been a problem until someone in the network added a DNS entry that had localhost.xxx.yyy (xxx.yyy being our domain), so we were trying to connect to that IP to send the hcicmd to.

                Our solution has been to have AIX first look at the hosts file, then the DNS for name resolution. Our sys admin is looking into that. Hopefully it will not require a reboot of the server.

                DNS administrator found an entry for localhost – removed it and flushed the cache, and a few minutes later, all was well.

                Thank you – the localhost definition in the network DNS was the problem; removing it and flushing the cache immedately solved our problem. None of our ports are below 8k or above 32k. and yes, when we had the problem, we had to use the command line stop which (eventually) used the sig kill.”

            • #118533
              Jeff Dawson
              Participant

                Environment details I forgot to add

                CIS version 6.2.6.1 running on AIX TL2 SP6 IBM Power 8+ server with plans to upgrade to CIS 20.1 hopefully middle of this year.

                • This reply was modified 3 years, 10 months ago by Jeff Dawson.
                • This reply was modified 3 years, 10 months ago by Jeff Dawson.
              • #118536
                Jim Kosloskey
                Participant

                  Here is my .02 worth…

                  I think it is possible the restart did not properly stop and start the process and all of the threads.

                  This could be due to the load at the time on the process. This also might be why this is not repeatable in Test and may only happen periodically in Production.

                  Perhaps another way of accomplishing the same end is to individually stop all of the IB threads in the site, then once those all have stopped (reflected on the source sites as well) make sure the ob thread has no pending queue then stop that thread and when it is stopped stop the process. Once the process is stopped (you might need to kill it) then restart the process, then the OB thread and then each of the ib threads making sure each step is done before starting the next.

                  The above procedure will afford the opportunity verify everything is shutting down and starting back up properly. It may also help to point out where the load issues might be. I think it would be safer than the restart at least until the reason for the issue is determined and resolved.

                  If indeed this is load related then some time needs to be spent to determine the best way to adjust for the load. There could be a number of alternative architectures which might provide relief.

                  You just might be at or near the capacity for that site’s architecture.

                  I don’t think using specific IP address will solve the issue. It might seem like it if this only happens under certain load and process circumstances but I don’t think that is the real issue.

                  Just my thoughts – not necessarily steeped in any directly related experience of the same kind.

                  email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

                • #118537
                  Jeff Dawson
                  Participant

                    Thanks Jim,

                    Appreciate all the suggestions and the more I’ve dug into this I feel like the issue points toward how the application brought the process down.  I have on my list to split this specific process out to its own site since it now only deals with sending Ancillary reports that are base64 encoded pdfs to Epic’s Window server so it can be decoded then that server passes the HL7 message onto the corresponding Epic Bridges AIP.

                    Due to the size of PDF’s we get I’ve noticed this process continuously grows.  We have a ksh script that monitors large CIS process memory usage and I can confirm that night before the process was cycled using the Restart function it was sitting at 433MB.

                    433148 /cis/cis6.2/integrator/bin/hciengine -S epict1 -p Tepic_icrpt

                    I’ll pass along the suggestions to our interface team and once again appreciate all the feed back provided.   We do have our AIX admins looking over logs during the problematic time frame, if anything worth of noting is discovered i’ll make sure to provide an update.

                  • #118545
                    Jim Kosloskey
                    Participant

                      Perhaps periodic reclamation and reorganization of the Recovery DB is needed.

                      There have been discussions on the forum as to how this can be done in later releases (later than 6.2 perhaps) without shutting down the site.

                      I think considering moving the process to its own site is worthwhile.

                      I am guessing no ob queue buildup in that process is noted?

                      email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

                      • #118557
                        Jeff Dawson
                        Participant

                          Hi Jim,

                          I have not seen any type of outbound queue in the problematic process, I’m in the process of splitting this process out in our Test environment to a new site.  We do have a monthly recovery/error database script that runs and monitors any DB’s size that reaches an threshold greater than 80MB.  Since it’s 3/1 the script just ran and the site’s recovery DB in question was flagged at 108MB, which we’ll be re-initializing soon.

                    Viewing 5 reply threads
                    • You must be logged in to reply to this topic.