Engine process core dump

This topic has 21 replies, 10 voices, and was last updated 9 years, 4 months ago by Bob Schmid.

Creator

Topic
December 8, 2014 at 3:43 pm #54487
Jim
Participant
We are getting the following error messages and then a core dump on the process. Has anyone experienced anything similar? There 32 of the write timeout expired messages prior to the core dump.

[pdl :PDL :ERR /0:hie_sof_mdm_ob:12/07/2014 01:05:35] write timeout expired

[pdl :PDL :ERR /0:hie_sof_mdm_ob:12/07/2014 01:05:35] PDL signaled exception: code 1, msg write failure

[pti :sign:WARN/0:hie_sof_mdm_ob:12/07/2014 01:05:35] Thread 24 ( hie_sof_mdm_ob ) received signal 11

[pti :sign:WARN/0:hie_sof_mdm_ob:12/07/2014 01:05:35] PC = 0xffffffff

Jim
Creator

Topic

Viewing 20 reply threads

Author

Replies
- December 9, 2014 at 12:22 am #81679
  Charlie Bursell
  Participant
  Signal 11, or officially know as “segmentation fault”, means that the program accessed a memory location that was not assigned.
  
  Is this one of the “standard” Cloverleaf PDLs or or one written by you? What version of Cloverleaf? Has this thread been running successful for a time before this started? Any recent changes?
  
  You might try recompiling or better still, if it is an MLP PDL, use the encapsulation mode in the standard TCP/IP protocol
- December 10, 2014 at 1:20 pm #81680
  Jim
  Participant
  Charlie we are upgraded to 6.0.2 on 11/8 and this is the standard mlp_tcp.pdl. It seem at first to be just our VPN connections but then it happened again with one of our in-house conncections. We have never used the encapsulation mode in the standard TCP/IP. How do we set this up correctly?
  
  Thank you,
  
  Jim
  
  Jim
- December 10, 2014 at 1:21 pm #81681
  Jim
  Participant
  Forgot to add that these connections have been in place for years.
  
  Jim
- December 10, 2014 at 2:56 pm #81682
  Jim
  Participant
  Charlie,
  
  We tried recompiling the pdl to no avail the process panicked about an hour later. It only seems to crash when there are over 100 messages queued.
  
  Jim
- December 11, 2014 at 1:58 am #81683
  Charlie Bursell
  Participant
  If crashing with messages queued then it does not seem to be the PDL. There is no queing within the PDL.
  
  Have you tried turning up full EO to see what is happening just prior to the crash?
  
  If you want to try the encapsulated mode:
  
  Protocol:tcpip -> Properties
  
  Encapsulated -> Configure
  
  Just accept defaults – should be OK for standard MLLP
- December 11, 2014 at 1:58 pm #81684
  Jim
  Participant
  Charlie,
  
  It seems that when a thread is in an opening or up (but not connected) status and the queue depth is greater than 100 process panics. We have changed the engine output to enable_all and submitted the logs and core dumps to both McKesson and Infor. I have had no resolution yet, but still hoping they come up with some idea of what is wrong.
  
  Jim
  
  Jim
- December 11, 2014 at 2:02 pm #81685
  Jim
  Participant
  Cahrlie,
  
  I have changed 2 of the affected thread to use the TCP/IP with encapsulation, but have experienced no outages since the change.
  
  Jim
  
  Jim
- December 11, 2014 at 8:05 pm #81686
  Russ Ross
  Participant
  I mostly quit using straight TCP/IP connections with length encoding a good while back even though they are more efficeint.
  
  The motivation to stop using them was due to the undesired side effect that the log file gets pounded when they aren’t connected.
  
  It does not take long for the log file to grow to the max allowed size of 1 GB on our server, which could result in a process crash/hang.
  
  We now have a best practice to configure our processes to cycle logs at 50 MB inside the NetConfig to also catch other run away log output.
  
  I don’t know if this undesired side effect lives on in current versions of cloverleaf because I waved good by to TCP/IP lenght ecnoded connections for our internal hops a long long time ago.
  
  We are using the less efficient PDL/MLP encapuslated connections with a flat “ACK”, granted more wastefull on the machine but more robust to support.
  
  Russ Ross
  RussRoss318@gmail.com
- December 16, 2014 at 1:10 pm #81687
  Jim
  Participant
  Fellow Clovertechers
  
  As it turns out this is a bug in 6.0.2 which manifests itself when you have a connection that is either up or opening and several hundred messages queued for that thread. The process panics and produces a core dump. This is what support had to say;
  
  “R&D told me this was a bug that was reported and the fix will be in 6.0.3 which should be coming out soon. They suggest to go back to 6.0 until the patch is available. “
  
  Jim
  
  Jim
- December 17, 2014 at 2:57 pm #81688
  Bob Richardson
  Participant
  Greetings,
  
  We are running the 6.1.0 Integrator on our Test server right now and a bit puzzled that they will patch the base 6.0 Release?
  
  No mention of 6.1 currently available?
  
  With its planned patch of .01?
  
  Thanks.
- December 19, 2014 at 9:24 pm #81689
  Rob Abbott
  Keymaster
  Bob, as far as I’m aware this PDL issue is resolved in 6.1.
  
  We’ve just released 6.0.3 which resolves this in 6.0.
  
  We provide official support for both the 6.1 and 6.0 releases – which means we provide patches for both.
  
  Rob Abbott
  Cloverleaf Emeritus
- December 19, 2014 at 9:28 pm #81690
  Bob Richardson
  Participant
  Rob,
  
  I go get concerned about fixes in earlier releases that may not be passed along in a later release.
  
  Thanks for the confirmation here.
- December 19, 2014 at 9:38 pm #81691
  Rob Abbott
  Keymaster
  No problem 🙂 Actually the fix for 6.0.3 was backported from 6.1. So you should be good to go testing 6.1.
  
  Rob Abbott
  Cloverleaf Emeritus
- December 30, 2014 at 1:09 pm #81692
  Alice Kazin
  Participant
  Are you sure this issue is fixed in 6.1? I have multiple core dumps in my 6.1 instance.
- September 22, 2015 at 1:58 pm #81693
  Ted Viens
  Participant
  This does not appear to be resolved in 6.1. I am getting core dumps as well relgaed to signal exception exceptions.
  
  Any advice on how to resolve this in 6.1 would be appreciated.
  
  [pdl :PDL :ERR /0:to_sgsound_adt:09/21/2015 15:18:09] PDL signaled exception: code 1, msg write failure
  
  [pti :sign:WARN/0:to_sgsound_adt:09/21/2015 15:20:14] Thread 3 ( to_sgsound_adt ) received signal 11
  
  [pti :sign:WARN/0:to_sgsound_adt:09/21/2015 15:20:14] PC = 0x10025714
  
  Ted Viens
- September 30, 2015 at 1:58 pm #81694
  Bob Schmid
  Participant
  Serious one this morning….impacting lab…didn’t catch right away as well as messages disappeared????
  
  Running 6.1.1……
  
  Incident : 9015406 Summary: process crash 09/30/2015
  
  Bob
  
  I did not recompile PDls from 585 to 61……(did I see that as a recommendation) ?
  
  /etc/security/limits
  
  hci:
  
  fsize = 2097151
  
  core = 2097151
  
  cpu = -1
  
  nofiles = 10000
  
  rss = -1
  
  stack = -1
  
  data = -1
  
  BUT…I did not necessarrily restart )all the) processes after the above limits change…..so there is hope…..but it does beg the question…why the need to raise for 611 ?
- September 30, 2015 at 2:45 pm #81695
  James Cobane
  Participant
  If I recall, I believe the ‘hcirootcopy’ process recompiles the pdl’s for you when you promote the sites.
  
  Jim Cobane
  
  Henry Ford Health
- September 30, 2015 at 2:54 pm #81696
  Bob Schmid
  Participant
  I think youre correct Jim
- March 28, 2016 at 8:01 pm #81697
  Mark Stancil
  Participant
  We are running Cloverleaf 6.1.1 and we have an issue with a process going down in a signal 11 when we bounce an UP thread with a few hundred messages in it. Should this be happening in 6.1.1? Is there a patch we are not aware of?
- March 29, 2016 at 1:34 pm #81698
  James Cobane
  Participant
  If this is only happening for the one process, you may want to take a look at any of the procs/tcl that may be employed on those associated threads. Particularly something that may be running in ‘start’ mode context within a proc since it appears to be happening when bounce a thread.
  
  Jim Cobane
  
  Henry Ford Health
- March 31, 2016 at 3:33 pm #81699
  Bob Schmid
  Participant
  Guinea pig: Infor has delivered to me a possible fix for 611,
  
  We have run without incident in the test environments. Plan on moving to prod end of next week.
  
  When GA?
  
  Setting to solution proposed for applying 6.1.2.1 when available.
  
  In the interim:
  
  Change your protocol to straight tcp/ip encapsulation,,,,,,from pdl… pdl is the issue
  
  Bob
Author

Replies

Viewing 20 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.