Hello,
I have an incident open with Infor but it was suggested by them to also post this topic to see if anyone else has ever come across this type of odd behavior. For internal engine communication we protocol:tcp with MLLP2 encapsulation for the Host: setting we’ve always used localhost. Per Infor support I’m working on changing all of our threads from localhost to to physical ip address of the server, which in our case would the cluster or service ip address.
Our setup we have sites built out depending on hospital or if the ancillary system is big enough, it could be setup as a site i.e. we use Sunquest as our LIS and have a “lab” site. We use Epic as our HIS so we have a site that houses all of our connections going To Epic. In this case we have one specific connection/endopoint going back to Epic’s interconnect server that receives different types of base64 encoded PDF’s but also Powerscribe textual radiology reports and Pacs image available messages. For this specific process we have 17 threads. 1 Thread is the actual connection to Epic and the other 16 threads act as servers for internal engine communication from other CIS sites that have a corresponding client thread using local host and a unique port for each connection that needs to send data to this specific Epic connection. (Side note we are looking at reducing this setup and just having 1 or 2 threads use multi-server to reduce the thread count)
On 2/22/21 one of our analysts cycled this process using the CIS Gui, via the NetMonitor > right clicking on the process in question > Control > Restart. A little while later our interface team started getting critical tickets from 5 hospitals in our WI region stating missing Powerscribe radiology report data. When I logged in with a few other analysts data was being received in our engine from the initial ancillary system connection, Powerscribe in this case, which we confirmed by looking in SMAT. Powerscribe was then sending data from the hospital site to the Epic site in question however when i looked at the two threads (client and server) both were in an “up” state however the last written time on the server thread had not updated for over an hour. I confirmed 100’s of Powerscribe reports that were supposed to go the endpoint never made it to the actual Epic end point connection, from what I can tell the client/server internal engine setup was behaving like a thread setup to file /dev/null or going to the bitbucket.
I started checking other threads (that act as a server) in this proccess and all the others were behaving the same way. Looking back I wish I would have contacted Infor support at this time or first turned on enable all debugging. Knee jerk response was to issue an hcienginestop on the process then hcienginerun, after doing so we immediately started receiving data from all the different CIS sites that send to our Epic site. Our team then started going through system by system to look at what was received at the system level vs the SMAT file on the thread that sends to Epic to find out what was not delivered. Overall it was 30+ systems resulting in hundreds of messages being resent due to them missing or being lost in these tcp/ip connections.
To me it looks like when the “Restart” command was issued from the CIS Gui it seems like something socket related was not closed correctly or the process was in some type of hung state. The odd part to me is if there was a connectivity all of the messages should have queued at the client thread level not making to the server thread side.
I’d like to list a few things we checked
I’ll paste in the next reply the only logging I could find at a process level, which is interesting the process dumped this out to $HCISITEDIR/exec/errrors. I’ve only seen that happen when a core dump occurs at the process level which in this case only this log and a startup log no core file was found.
I’d also like to share what Infor provided from the incident, relating to #4 connection refused errors this appears to have occurred during the time the “Restart” command was issued. Here are a few of the logging entires from the client side, all of these 16 client threads had this logging message but I still feel like on startup something at a socket level wasn’t right since the logging did go away.
/cis/cis6.2/integrator/imaging/exec/processes/Tepic_icrpt/Tepic_icrpt.err:[tcp :open:ERR /0: TSEt1_icrpt:02/22/2021 18:27:20] Can’t connect to localhost:15230 – Connection refused
/cis/cis6.2/integrator/wwd/exec/processes/Tepic_rad/Tepic_rad.err:[tcp :open:ERR /0: TSEt1_icrpt:02/22/2021 18:27:05] Can’t connect to localhost:15825 – Connection refused
Looking over the logs, The only issue it stated was pertaining to connection refused. Those are all network connection issues. In Support, we tend to tell our customers not to use localhost, but rather the physical IP address. Overall Cloverleaf was stating it couldn’t see/find the ports defined.
Suggestions moving forward:
1. I have sent a note to our Service Team to see if anyone can offer suggestions on handling large PDF files using base64 encoding. I will update the incident if I get a response.
2. Instead of using localhost, use the IP address of the server.
3. Should the issue occur again, Open the Network Monitor and set enable_all for the process in question. This puts the process in debug mode without having to bounce the process.
4. If you continue to see connection refused errors, I would recommend getting your Network Team involved by having them run a sniffer on the port. They will be able to provide more information on why the port(s) are not connecting.