PDL errors

  • Creator
    Topic
  • #51444
    Chris Williams
    Participant

      Does anyone have any insight into what this error message is really reporting? We wound up with over 4 million of these lines in one of the process logs on Sunday, just before a core dump:

      [pdl :PDL :ERR /0:   OB_LABTEST:12/20/2009 10:42:59] read returned error 34 (Numerical result out of range)

      Thanks.

    Viewing 9 reply threads
    • Author
      Replies
      • #70327
        Mike Ellert
        Participant

          Hi Chris.  Did you ever get any help on this?  We struggle with the same issue and it ONLY occurs on threads over VPN tunnels.  I’ve never found a solution to the problem.

        • #70328
          Jim Kosloskey
          Participant

            As far as I know this is a system dependent error that is caused by the receiving system. In other words this is like a ‘catch all’ code that only the receiving system knows why it was generated.

            It is the receiving system that would need to tell you under what conditions this TCP/IP return gets set.

            PDL is receiving this error status when communicating with the receiving system I believe.

            I do not recall seeing this error in our environment although I have to say we do not communicate via a VPN.

            email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

          • #70329
            Chris Williams
            Participant

              One bit I left out was that this is a VPN connection that gets hung in a FIN_WAIT2 state. I’ve got the tcp-keepalive set at 15 minutes, so we shouldn’t be having a time-out problem. Any idea why we would get such a huge flood of error messages in the log? The affected process seems to grab all the available cpu time, and all the other processes grind to a halt.

            • #70330
              Jennifer Hardesty
              Participant

                We have the same problem with one of our receiving systems that has 6 outbound queues/threads.  It happens about once every 7 – 10 days.  Though the connection goes through a VPN, the excuse that the VPN is dropping due to a lull cannot be used — the drop always occurs during the daytime and the data being routed through the connections is massive amounts of ADT, labs, medical records and transcriptions from multiple facilities and applications.

                The problem presents itself usually two or three hours after it has actually begun — no alerts on Cloverleaf fire.  It will begin with messages like the following appearing in the log:

                Code:

                to_hin_miles_txt/RESEND_OB_MSG_38/reply_gen:     Resend 8 of saved msg with MID ‘0.0.103374503’ at Thu Mar 25 13:59:28 EDT 2010

                Once the count hits 8 for an interface, the following will appear in the log:

                Code:


                [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 13:59:31] read failed: Connection timed out
                [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 13:59:31] read returned error 78 (Connection timed out)
                [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 13:59:31] PDL signaled exception: code 1, msg device error (remote side probably shut down)

                I have no idea where the “8” comes from.  It doesn’t seem to be hard coded anywhere.  Shortly afterward, the log will begin filling with:

                Code:


                [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 15:29:16] read returned error 78 (Connection timed out)
                [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 15:29:16] read returned error 78 (Connection timed out)
                [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 15:29:16] read returned error 78 (Connection timed out)
                [pdl :PDL :ERR /0:to_hin_miles_txt:03/25/2010 15:29:16] read returned error 78 (Connection timed out)

                The thread may or may not appear yellow or red, indicating it is backed up.  Most of the time, it remains green — often for hours.  There will be breaks in this repetitive message writing so the details of output and related responses for the other threads going through the same VPN to the same application (but to other ports) can be written to the log.  Eventually, however, the other threads in the process will become affected at random; some of the other outbound threads will begin to back up and the queues behind them will begin to back up until eventually our other interface engine will be affected, which is when an alert gets triggered — as I said, a few hours down the line.

                By then, the log and err file are 100+ GB.  The only solution is to stop the whole process, cycle it, save and archive off the log & err files and start the process all clean and new.  If we don’t, it’s possible we’ll end up with a dbvista error and a down site later in the day, not just a down interface.

                I don’t understand why this particular interface behaves any differently than any of the others.  We have other interfaces going through VPN tunnels and the basic set up on the inbound and outbound tabs matches just about every other interface set up on the Production server.  Of course, the Vendor never sees a problem when it occurs either.

              • #70331
                Robert Gordon
                Participant

                  The problem with the Fin_Wait condition can be solved with an OS patch, which I know still exists for Unix, AIX, Windoze.

                  The second problem regarding the lab field is really simple, if your not using the field in cloverleaf for a calculation, in which case 99.999% of the time your not, change the field in your variant definition and up the size by at least 1 or 2 characters, since your variant definition probably is concating the lab result since the number is greater.  Or use a TCL proc on the field in question to be a shorter value i.e. 1.0000000 can still be represented as 1.00, but check with your hospital lab people before doing any form of rounding or truncation as all numeric places in lab results do have meaning.

                • #70332
                  Jennifer Hardesty
                  Participant

                    Robert Gordon wrote:

                    The problem with the Fin_Wait condition can be solved with an OS patch, which I know still exists for Unix, AIX, Windoze.

                    We just upgraded our AIX for 5.7.  Which OS versions are you referring to?

                  • #70333
                    Jim Kosloskey
                    Participant

                      Jennifer,

                      I think the 8 is coming from a proc you are using called RESEND_OB_MSG_38 which it appears is part of your acknowledgment handling (looks like it is to resend the outbound message) see this piece of log you posted?

                      to_hin_miles_txt/RESEND_OB_MSG_38/reply_gen:     Resend 8 of saved msg with MID ‘0.0.103374503’ at Thu Mar 25 13:59:28 EDT 2010

                      So I would look at that proc and see where the above display is being generated and then what action it takes.

                      My guess is you are not receiving an acknowledgement from the receiving system and are resending until some threshold is reached. It is when that threshold is reached that your situation seems to get worse.

                      I am guessing the proc in question is the one sensitive to the resend threshold and you probably need to find out what it is trying to do when the threshold is reached.

                      Also, since you are on 5.7, there are new recovery procs that handle the new handle for the new state a message can be in. If you are not using the new recovery procs, do you know if your recovery procs (the one in question looks like a part of that set) have been modified to work with the new recovery in 5.7?

                      email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

                    • #70334
                      Jennifer Hardesty
                      Participant

                        We are not on 5.7 yet.  We are in the process of upgrading.  We have upgraded our OS to the appropriate minimum that 5.7 can run on which 5.4 can still run on.  We have one process running in 5.7 in both our Test and Production environments.  Everything else is still running on 5.4.

                        As for the resend_ob_38 tclproc.  I don’t think it’s much different from anyone else’s resend tclproc from the recover tcl functions — pre-5.7.  One thing to note:  The “Outbound Retries” are all set to 5 for this process.  Clearly that’s not what is happening.

                        Code:


                        ######################################################################
                        # Name:         resend_ob_msg_38
                        # Purpose:      If we didn’t receive a reply, then resend the msg
                        #               and save it (if save_ob_msg is being used)
                        #               To be used in the IB REPLY Gen TPS ONLY !
                        # Args: tps keyedlist containing:
                        #       MODE    run mode (”start” or “run”)
                        #       MSGID   message handle
                        #       ARGS    keyed list of user arguments containing:
                        #               DEBUG      0 (default). 1 provides more output
                        #               RESEND     0 = do not resend, go on with next message
                        #                          1 (default) = keep resending message until replied to
                        #               RETRIES    the maximum number of message resends
                        #                          (defaults to 3). If the number of retries is exceeded
                        #                          the ‘ob_save’ message will be forwarded to the error
                        #                          database. Value “-1” will keep resending
                        #
                        # Returns: tps keyed list containing
                        #       CONTINUE        Of all msgs if Resend 1
                        #       KILL            Of all msgs if Resend 0
                        #
                        # usage:    ‘Inbound>Configure…Reply Generation…’
                        #
                        proc resend_ob_msg_38 { args } {
                           global HciConnName ob_save _resendRetry38 _validateRetry38

                           keylget args MODE mode
                           keylget args CONTEXT ctx

                           #set module “[string toupper [lindex [info level 0] 0]]/$HciConnName/$ctx”
                           set module “$HciConnName/[string toupper [lindex [info level 0] 0]]/$ctx”

                           keylget args ARGS uargs
                           set debug         0       ; keylget uargs DEBUG   debug
                           set resend        1       ; keylget uargs RESEND  resend
                           set _maxRetries38 -1      ; keylget uargs RETRIES _maxRetries38

                           set dispList {}
                           switch -exact — $mode {
                               start {
                                   # Initialize the ob_save global
                                   set ob_save “”
                                   set _resendRetry38 1
                                   echo “$module: Using settings – RESEND = $resend – RETRIES = $_maxRetries38”
                                   return “”
                               }

                               run {
                                   keylget args MSGID mh

                                   if { ![cequal $ctx “reply_gen”] } {
                                       echo “$module: Called with invalid context! Should be REPLY GENERATION only.”
                                       echo “$module: Continuing msg.”
                                       lappend dispList “CONTINUE $mh”
                                   }

                                   if { [cequal $ob_save “”] } {
                                       # The save_ob_msg proc is not used, or state 14 after thread restart
                                       set mid [msgmetaget $mh MID]                ; # {DOMAIN 0} {HUB 0} {NUM 0}
                                       catch {set mid “[keylget mid DOMAIN].[keylget mid HUB].[keylget mid NUM]”}

                                       set msgState [msgmetaget $mh STATE]
                                       if { $msgState == 14 } {
                                           if { $debug } {
                                               echo “$module:     Resending state 14 msg with MID ‘$mid’ at [fmtclock [getclock]]”
                                           }

                                           lappend dispList “PROTO $mh”
                                       } else {
                                           if { $debug } {
                                               echo “$module:     Variable ob_save is empty, and no response within timeout.”
                                               echo “$module:     Resending timed out msg with MID ‘$mid’ at [fmtclock [getclock]]”
                                           }

                                           lappend dispList “CONTINUE $mh”
                                       }
                                   } else {
                                       # The save_ob_msg proc is being used
                                       # — destroy the new msg, possibly resend the saved msg

                                       set currmid [msgmetaget $mh MID]            ; # {DOMAIN 0} {HUB 0} {NUM 0}
                                       catch {set currmid “[keylget currmid DOMAIN].[keylget currmid HUB].[keylget currmid NUM]”}

                                       if { $debug } {
                                           echo “$module:     No response within timeout. Killing timed out msg with MID ‘$currmid'”
                                       }
                                       lappend dispList “KILL $mh”

                                       # — and resend the saved msg.

                                       set orgmid  [msgmetaget $ob_save MID]       ; # {DOMAIN 0} {HUB 0} {NUM 0}
                                       catch {set orgmid “[keylget orgmid DOMAIN].[keylget orgmid HUB].[keylget orgmid NUM]”}

                                       if { $resend == 0 } {
                                           echo “$module:     Resend is off. Killing msg with MID ‘$orgmid’ at [fmtclock [getclock]]”

                                           lappend dispList “KILL $ob_save”
                                           set _resendRetry38 0
                                       } elseif { [cequal $_maxRetries38 “-1″] } {
                                           echo “$module:     Resend $_resendRetry38 of saved msg with MID ‘$orgmid’ at [fmtclock [getclock]]”

                                           lappend dispList “PROTO $ob_save”
                                           incr _resendRetry38
                                       } elseif { $_resendRetry38 <= $_maxRetries38 } {
                                           if { $debug } {
                                               echo "$module:     Resend Retry: $_resendRetry38  —  Maximum = $_maxRetries38"
                                           }
                                           echo "$module:     Resending saved msg with MID '$orgmid' at [fmtclock [getclock]]"

                                           lappend dispList "PROTO $ob_save"
                                           incr _resendRetry38
                                       } else {
                                           # Max retries reached. Error msg out
                                           set errmsg "$module:     Maximum of $_maxRetries38 retries reached, but no reply…"
                                           echo "$module: $errmsg"

                                           set usrdata [msgmetaget $ob_save USERDATA]
                                           catch {keylset usrdata ERROR $errmsg}
                                           msgmetaset $ob_save USERDATA $usrdata

                                           lappend dispList "ERROR $ob_save"
                                           set _resendRetry38   0
                                           set _validateRetry38 0
                                       }

                                       # Clear the global — the msg will be re-saved after send.
                                       set ob_save ""
                                   }
                               }

                               shutdown {
                               }

                               default {
                                   echo "Unknown mode '$mode' in $module"
                               }
                           }
                           return $dispList
                        }

                      • #70335
                        Jim Kosloskey
                        Participant

                          Jennifer,

                          The resend proc is issuing the message you are seeing:

                          Code:

                          to_hin_miles_txt/RESEND_OB_MSG_38/reply_gen:     Resend 8 of saved msg with MID ‘0.0.103374503’ at Thu Mar 25 13:59:28 EDT 2010

                          here is the code from the proc:

                                        if { $resend == 0 } {

                                             echo “$module:     Resend is off. Killing msg with MID ‘$orgmid’ at [fmtclock [getclock]]”

                                             lappend dispList “KILL $ob_save”

                                             set _resendRetry38 0

                                         } elseif { [cequal $_maxRetries38 “-1”] } {

                                             echo “$module:     Resend $_resendRetry38 of saved msg with MID ‘$orgmid’ at [fmtclock [getclock]]”

                                             lappend dispList “PROTO $ob_save”

                                             incr _resendRetry38

                           

                          Check the RETRIES argument to the resend_ob_msg_38 proc – I am guessing it has a -1 as its value.

                          If that is the case then you will be resending forever and I am suspecting that eventually the receiving system gets to the point where it disconnects.

                          email: jim.kosloskey@jim-kosloskey.com 29+ years Cloverleaf, 59 years IT - old fart.

                        • #70336
                          Jennifer Hardesty
                          Participant

                            No, we don’t pass arguments to the  recovery procs.  As you can see in the screenshot I posted above, we expected the maximum number of retries to be “5”; however, that is not the case.  I’m not sure what the point of that field is if not to define the number of retries for the resend proc.

                        Viewing 9 reply threads
                        • The forum ‘Cloverleaf’ is closed to new topics and replies.