PDL errors

This topic has 10 replies, 5 voices, and was last updated 16 years, 3 months ago by Jennifer Hardesty.

Creator

Topic
December 22, 2009 at 10:18 pm #51444
Chris Williams
Participant
Does anyone have any insight into what this error message is really reporting? We wound up with over 4 million of these lines in one of the process logs on Sunday, just before a core dump:

[pdl :PDL :ERR /0: OB_LABTEST:12/20/2009 10:42:59] read returned error 34 (Numerical result out of range)

Thanks.
Creator

Topic

Viewing 9 reply threads

Author

Replies
- January 13, 2010 at 4:07 pm #70327
  Mike Ellert
  Participant
  Hi Chris. Did you ever get any help on this? We struggle with the same issue and it ONLY occurs on threads over VPN tunnels. I’ve never found a solution to the problem.
- January 13, 2010 at 4:34 pm #70328
  Jim Kosloskey
  Participant
  As far as I know this is a system dependent error that is caused by the receiving system. In other words this is like a ‘catch all’ code that only the receiving system knows why it was generated.
  
  It is the receiving system that would need to tell you under what conditions this TCP/IP return gets set.
  
  PDL is receiving this error status when communicating with the receiving system I believe.
  
  I do not recall seeing this error in our environment although I have to say we do not communicate via a VPN.
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 61 years IT – old fart.
- January 13, 2010 at 5:15 pm #70329
  Chris Williams
  Participant
  One bit I left out was that this is a VPN connection that gets hung in a FIN_WAIT2 state. I’ve got the tcp-keepalive set at 15 minutes, so we shouldn’t be having a time-out problem. Any idea why we would get such a huge flood of error messages in the log? The affected process seems to grab all the available cpu time, and all the other processes grind to a halt.
- March 26, 2010 at 7:43 pm #70330
  Jennifer Hardesty
  Participant
  We have the same problem with one of our receiving systems that has 6 outbound queues/threads. It happens about once every 7 – 10 days. Though the connection goes through a VPN, the excuse that the VPN is dropping due to a lull cannot be used — the drop always occurs during the daytime and the data being routed through the connections is massive amounts of ADT, labs, medical records and transcriptions from multiple facilities and applications.
  
  The problem presents itself usually two or three hours after it has actually begun — no alerts on Cloverleaf fire. It will begin with messages like the following appearing in the log:
  
  Code: to_hin_miles_txt/RESEND_OB_MSG_38/reply_gen: Resend 8 of saved msg with MID ‘0.0.103374503’ at Thu Mar 25 13:59:28 EDT 2010
  
  Once the count hits 8 for an interface, the following will appear in the log:
  
  Code: [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 13:59:31] read failed: Connection timed out [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 13:59:31] read returned error 78 (Connection timed out) [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 13:59:31] PDL signaled exception: code 1, msg device error (remote side probably shut down)
  
  I have no idea where the “8” comes from. It doesn’t seem to be hard coded anywhere. Shortly afterward, the log will begin filling with:
  
  Code: [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 15:29:16] read returned error 78 (Connection timed out) [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 15:29:16] read returned error 78 (Connection timed out) [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 15:29:16] read returned error 78 (Connection timed out) [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 15:29:16] read returned error 78 (Connection timed out)
  
  The thread may or may not appear yellow or red, indicating it is backed up. Most of the time, it remains green — often for hours. There will be breaks in this repetitive message writing so the details of output and related responses for the other threads going through the same VPN to the same application (but to other ports) can be written to the log. Eventually, however, the other threads in the process will become affected at random; some of the other outbound threads will begin to back up and the queues behind them will begin to back up until eventually our other interface engine will be affected, which is when an alert gets triggered — as I said, a few hours down the line.
  
  By then, the log and err file are 100+ GB. The only solution is to stop the whole process, cycle it, save and archive off the log & err files and start the process all clean and new. If we don’t, it’s possible we’ll end up with a dbvista error and a down site later in the day, not just a down interface.
  
  I don’t understand why this particular interface behaves any differently than any of the others. We have other interfaces going through VPN tunnels and the basic set up on the inbound and outbound tabs matches just about every other interface set up on the Production server. Of course, the Vendor never sees a problem when it occurs either.
- March 26, 2010 at 9:43 pm #70331
  Robert Gordon
  Participant
  The problem with the Fin_Wait condition can be solved with an OS patch, which I know still exists for Unix, AIX, Windoze.
  
  The second problem regarding the lab field is really simple, if your not using the field in cloverleaf for a calculation, in which case 99.999% of the time your not, change the field in your variant definition and up the size by at least 1 or 2 characters, since your variant definition probably is concating the lab result since the number is greater. Or use a TCL proc on the field in question to be a shorter value i.e. 1.0000000 can still be represented as 1.00, but check with your hospital lab people before doing any form of rounding or truncation as all numeric places in lab results do have meaning.
- March 27, 2010 at 5:38 pm #70332
  Jennifer Hardesty
  Participant
  ~~Robert Gordon wrote:~~
  
  The problem with the Fin_Wait condition can be solved with an OS patch, which I know still exists for Unix, AIX, Windoze.
  
  We just upgraded our AIX for 5.7. Which OS versions are you referring to?
- March 27, 2010 at 10:57 pm #70333
  Jim Kosloskey
  Participant
  Jennifer,
  
  I think the 8 is coming from a proc you are using called RESEND_OB_MSG_38 which it appears is part of your acknowledgment handling (looks like it is to resend the outbound message) see this piece of log you posted?
  
  to_hin_miles_txt/RESEND_OB_MSG_38/reply_gen: Resend 8 of saved msg with MID ‘0.0.103374503’ at Thu Mar 25 13:59:28 EDT 2010
  
  So I would look at that proc and see where the above display is being generated and then what action it takes.
  
  My guess is you are not receiving an acknowledgement from the receiving system and are resending until some threshold is reached. It is when that threshold is reached that your situation seems to get worse.
  
  I am guessing the proc in question is the one sensitive to the resend threshold and you probably need to find out what it is trying to do when the threshold is reached.
  
  Also, since you are on 5.7, there are new recovery procs that handle the new handle for the new state a message can be in. If you are not using the new recovery procs, do you know if your recovery procs (the one in question looks like a part of that set) have been modified to work with the new recovery in 5.7?
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 61 years IT – old fart.
- March 30, 2010 at 2:37 pm #70334
  Jennifer Hardesty
  Participant
  We are not on 5.7 yet. We are in the process of upgrading. We have upgraded our OS to the appropriate minimum that 5.7 can run on which 5.4 can still run on. We have one process running in 5.7 in both our Test and Production environments. Everything else is still running on 5.4.
  
  As for the resend_ob_38 tclproc. I don’t think it’s much different from anyone else’s resend tclproc from the recover tcl functions — pre-5.7. One thing to note: The “Outbound Retries” are all set to 5 for this process. Clearly that’s not what is happening.
  
  Code: ###################################################################### # Name: resend_ob_msg_38 # Purpose: If we didn’t receive a reply, then resend the msg # and save it (if save_ob_msg is being used) # To be used in the IB REPLY Gen TPS ONLY ! # Args: tps keyedlist containing: # MODE run mode (”start” or “run”) # MSGID message handle # ARGS keyed list of user arguments containing: # DEBUG 0 (default). 1 provides more output # RESEND 0 = do not resend, go on with next message # 1 (default) = keep resending message until replied to # RETRIES the maximum number of message resends # (defaults to 3). If the number of retries is exceeded # the ‘ob_save’ message will be forwarded to the error # database. Value “-1” will keep resending # # Returns: tps keyed list containing # CONTINUE Of all msgs if Resend 1 # KILL Of all msgs if Resend 0 # # usage: ‘Inbound>Configure…Reply Generation…’ # proc resend_ob_msg_38 { args } { global HciConnName ob_save _resendRetry38 _validateRetry38 keylget args MODE mode keylget args CONTEXT ctx #set module “[string toupper [lindex [info level 0] 0]]/$HciConnName/$ctx” set module “$HciConnName/[string toupper [lindex [info level 0] 0]]/$ctx” keylget args ARGS uargs set debug 0 ; keylget uargs DEBUG debug set resend 1 ; keylget uargs RESEND resend set _maxRetries38 -1 ; keylget uargs RETRIES _maxRetries38 set dispList {} switch -exact — $mode { start { # Initialize the ob_save global set ob_save “” set _resendRetry38 1 echo “$module: Using settings – RESEND = $resend – RETRIES = $_maxRetries38” return “” } run { keylget args MSGID mh if { ![cequal $ctx “reply_gen”] } { echo “$module: Called with invalid context! Should be REPLY GENERATION only.” echo “$module: Continuing msg.” lappend dispList “CONTINUE $mh” } if { [cequal $ob_save “”] } { # The save_ob_msg proc is not used, or state 14 after thread restart set mid [msgmetaget $mh MID] ; # {DOMAIN 0} {HUB 0} {NUM 0} catch {set mid “[keylget mid DOMAIN].[keylget mid HUB].[keylget mid NUM]”} set msgState [msgmetaget $mh STATE] if { $msgState == 14 } { if { $debug } { echo “$module: Resending state 14 msg with MID ‘$mid’ at [fmtclock [getclock]]” } lappend dispList “PROTO $mh” } else { if { $debug } { echo “$module: Variable ob_save is empty, and no response within timeout.” echo “$module: Resending timed out msg with MID ‘$mid’ at [fmtclock [getclock]]” } lappend dispList “CONTINUE $mh” } } else { # The save_ob_msg proc is being used # — destroy the new msg, possibly resend the saved msg set currmid [msgmetaget $mh MID] ; # {DOMAIN 0} {HUB 0} {NUM 0} catch {set currmid “[keylget currmid DOMAIN].[keylget currmid HUB].[keylget currmid NUM]”} if { $debug } { echo “$module: No response within timeout. Killing timed out msg with MID ‘$currmid'” } lappend dispList “KILL $mh” # — and resend the saved msg. set orgmid [msgmetaget $ob_save MID] ; # {DOMAIN 0} {HUB 0} {NUM 0} catch {set orgmid “[keylget orgmid DOMAIN].[keylget orgmid HUB].[keylget orgmid NUM]”} if { $resend == 0 } { echo “$module: Resend is off. Killing msg with MID ‘$orgmid’ at [fmtclock [getclock]]” lappend dispList “KILL $ob_save” set _resendRetry38 0 } elseif { [cequal $_maxRetries38 “-1″] } { echo “$module: Resend $_resendRetry38 of saved msg with MID ‘$orgmid’ at [fmtclock [getclock]]” lappend dispList “PROTO $ob_save” incr _resendRetry38 } elseif { $_resendRetry38 <= $_maxRetries38 } { if { $debug } { echo "$module: Resend Retry: $_resendRetry38 — Maximum = $_maxRetries38" } echo "$module: Resending saved msg with MID '$orgmid' at [fmtclock [getclock]]" lappend dispList "PROTO $ob_save" incr _resendRetry38 } else { # Max retries reached. Error msg out set errmsg "$module: Maximum of $_maxRetries38 retries reached, but no reply…" echo "$module: $errmsg" set usrdata [msgmetaget $ob_save USERDATA] catch {keylset usrdata ERROR $errmsg} msgmetaset $ob_save USERDATA $usrdata lappend dispList "ERROR $ob_save" set _resendRetry38 0 set _validateRetry38 0 } # Clear the global — the msg will be re-saved after send. set ob_save "" } } shutdown { } default { echo "Unknown mode '$mode' in $module" } } return $dispList }
- March 30, 2010 at 2:49 pm #70335
  Jim Kosloskey
  Participant
  Jennifer,
  
  The resend proc is issuing the message you are seeing:
  
  Code:
  
  to_hin_miles_txt/RESEND_OB_MSG_38/reply_gen: Resend 8 of saved msg with MID ‘0.0.103374503’ at Thu Mar 25 13:59:28 EDT 2010
  
  here is the code from the proc:
  
  if { $resend == 0 } {
  
  echo “$module: Resend is off. Killing msg with MID ‘$orgmid’ at [fmtclock [getclock]]”
  
  lappend dispList “KILL $ob_save”
  
  set _resendRetry38 0
  
  } elseif { [cequal $_maxRetries38 “-1”] } {
  
  echo “$module: Resend $_resendRetry38 of saved msg with MID ‘$orgmid’ at [fmtclock [getclock]]”
  
  lappend dispList “PROTO $ob_save”
  
  incr _resendRetry38
  
  Check the RETRIES argument to the resend_ob_msg_38 proc – I am guessing it has a -1 as its value.
  
  If that is the case then you will be resending forever and I am suspecting that eventually the receiving system gets to the point where it disconnects.
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 61 years IT – old fart.
- March 30, 2010 at 3:41 pm #70336
  Jennifer Hardesty
  Participant
  No, we don’t pass arguments to the recovery procs. As you can see in the screenshot I posted above, we expected the maximum number of retries to be “5”; however, that is not the case. I’m not sure what the point of that field is if not to define the number of retries for the resend proc.
Author

Replies

Viewing 9 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.