Extreme Slowness with CL 6.1.2 Windows

This topic has 8 replies, 5 voices, and was last updated 8 years, 11 months ago by Mike Kim.

Creator

Topic
August 16, 2016 at 4:06 pm #55161
Mike Kim
Participant
We recently upgraded our test site from CL 5.7 on Linux to CL 6.1.2 on Windows and are having extreme slowness processing messages. Server is Windows 2012 R2, 4 processor 16GB RAM (way more horsepower than what we have for the Linux production machine). CPU and disk utilization is very low. But messages queue up and take forever to even process through an Xlate and write to a file. Tried switching back from SMAT DB to old school SMAT files. That didn’t help. We uninstalled all the anti-virus software. No improvement. Support is through McKesson and they’re stumped.

It’s so slow that we keep getting these errors from cl_check_ack on the receiving threads:

‘KILL ‘ (returned by ‘cl_check_ack ‘) does not match { }

[pd :pdtd:WARN/0: to_xxxx:08/16/2016 11:01:17] Timed out while awaiting replies on thread. Resending reserved OB Message

Any thoughts/feedback greatly appreciated!
Creator

Topic

Viewing 7 reply threads

Author

Replies
- August 16, 2016 at 11:59 pm #84380
  Elisha Gould
  Participant
  Where are you using cl_check_ack? It can only be used in the TPS Inbound Reply.
  
  Also not 100% convinced about the code for that proc. Should the OBMSGID message be killed? I’m thinking the handling for this case changed in 5.8 (could be wrong).
  
  To accept the reply and process the next message:
  
  return “{KILLREPLY $mh}”
  
  To resend the message:
  
  return “{KILL $mh}”
- August 17, 2016 at 3:35 am #84381
  Charlie Bursell
  Participant
  The error indicates you are getting a message in the reply proc the engine says does not exist – no message handle. Do you have OB only set? You may be seeing something after the timeout. Weird!
  
  I hope you do not have Await Replies on when writing to a file! That would take forever 😀 If writing to TCP/IP have your IT guys put a sniffer on it and see what is really happening.
  
  I wrote the proc in question and it did not change. OBMSGID hold the handle of the original message sent in case you need to resend. Of course it must be disposed of either via a KILL or a PROTO to resend.
  
  Elisha:
  
  You cannot issue KILL for the reply handle. To do so would lock up the engine. The difference between KILL and KILLREPLY is that in addition to disposing of the handle it clears the await reply flag so the next message can be sent.
  
  We Usually store the OBMSGID in a variable, “my_mh” so the proper paradigm for your cases would be.
  
  To accept the reply and process the next message:
  
  return “{KILLREPLY $mh KILL $my_mh}”
  
  To resend the message:
  
  return “{KILLREPLY $mh PROTO $my_mh}”
- August 18, 2016 at 12:21 am #84382
  Elisha Gould
  Participant
  Ahh yes, your right Charlie.
  
  I forgot the reason that we use KILL is to ensure that we don’t flood the down stream system will messages, and cause more issues with filled up logs.
  
  If KILLREPLY/PROTO is used in sms_ib_reply, it will resend immediately, with no timeout before resending.
  
  If KILL is used, it will go to the Timeout Handling, so we have a proc in the Reply Generation to handle the resending.
- August 18, 2016 at 2:10 am #84383
  Russ Ross
  Participant
  Mike Kim you are welcome to call me and discuss.
  
  We saw inexplicable slowdown and bottle necks when upgrading from cloverleaf 5.6 to cloverleaf 6.0 on the same hardware and AIX 6.1 OS.
  
  We did several things that ended up being band aids but did help.
  
  One significant speed improvement was upgrading the OS from AIX 6.1 to AIX 7.1 in place, which actually worked without any observed issue.
  
  My suspicion is that one culprit might of been our check_reply proc did not look at OBMSGID (which I think is state 16).
  
  What we had in place from before was dealing with messages in state 14.
  
  When Viken held our hand through our Epic go-live, I had him add the logic to work both the old way and new way with OBSMGID that also utlizies a different NetConfig setup that our old way of implementing check reply.
  
  One easy way to determine if your check_reply proc is current is to grep it for OBMSGID and see if you get any hit at all.
  
  Russ Ross
  RussRoss318@gmail.com
- August 18, 2016 at 2:21 am #84384
  Russ Ross
  Participant
  Now having said that bit about OBMSGID being part of the new improved check_reply, let me say I have seen the sort of error you mentioned often enough I recall it.
  
  I’m talking about
  
  ‘KILL ‘ (returned by ‘cl_check_ack ‘) does not match { }
  
  even when the reply handling is properly configured.
  
  The most common cause of this has been an extra new-line character after the ending message encapsulation and before the beginning message encapsulation of the next message.
  
  This situation confuses the check_reply proc and if it results in a timeout and resend then that would cause message flow speed to be potentially horrible since this might happen for every message.
  
  Russ Ross
  RussRoss318@gmail.com
- August 30, 2016 at 1:32 am #84385
  Mike Kim
  Participant
  Thanks Charlie, Elisha and Ross for thoughtful feedback. Everyone is still stumped. We are kind of in a bind because support is through McKesson and they’ll only support production, not test.
  
  cl_check_ack is standard version with a slight modification I made in April to not assume MSA segment was 2nd segment in message (we had vendor sending SFT segment in front of MSA). To find the MSA index, I split the segments into a list and do an lsearch.
  
  I misspoke about where cl_check_ack is deployed. It is configured on sending, not receiving threads. Messages take several minutes to make it through the engine with almost no activity.
  
  Any ideas greatly appreciated!
- August 30, 2016 at 9:23 pm #84386
  David Coffey
  Participant
  I have in the last 90 days set up 6.12 on a 4 processor 2012 server such as yours. I also experienced the extremely slow throughput.
  
  I trolled the Clovertech logs and came across an entry with a reference to the the poor performance to the lower level socket related activities and the auto connect/reopon times. The initial issue was reported with previous versions.
  
  https://usspvlclovertch2.infor.com/viewtopic.php?t=5046&highlight=throughput
  
  I made changes to my server configuration relaxing the auto connect/reopen times from 5 seconds (somewhat aggressive!) to 60 seconds for both inbound and outbound threads. This corrected the poor throughout issue and I was able to being my testing.
- September 7, 2016 at 2:36 pm #84387
  Mike Kim
  Participant
  Hi David,
  
  Your advice about the auto reconnect times was spot on. I think that helped a lot and also, apparently the sys admins still had some anti-virus software still on there that might have been a contributing factor. Much faster now. Thanks!
Author

Replies

Viewing 7 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.