Read returned error 2 in engine log over 3,000 times/second

Clovertech Forums Read Only Archives Cloverleaf Cloverleaf Read returned error 2 in engine log over 3,000 times/second

  • Creator
    Topic
  • #52462
    Jim Rawls
    Participant

      We have a VPN connection throwing this error over 3,000 times per second:  

      [pdl :PDL :ERR /0:to_hub_orumdm:05/06/2011 12:08:48] read returned error 2 (No such file or directory)

      We are on Cloverleaf 5.7 rev 2 on Linux RH 5.  When this happens it goes on for less than a minute but it’s happening more frequently.

      Does anyone know anything about this condition?

    Viewing 7 reply threads
    • Author
      Replies
      • #74301
        Ed Mastascusa
        Participant

          Hi Jim,

          our environment is CL 5.5 on AIX 5.3

          We’ve had similar issues on 2 of our VPN connections. (never on a non-VPN socket)

          In our case the read error was a slightly different number (our threads are using r/t TCP-IP HL7 MLP). Also, our errors would not stop until our log files would hit the max unix file size and the engine process panicked. The errors were fast enough (5K / second) a panic was guarantted within 10 minutes.

          The only “solution” we could come up with was a cron job running every 2 minutes that scanned the log files for the phrase “read returned error”. When > 100 of the phrases are in a log file we cycle the process via an hcienginestop and hcienginerun.  In some 10s of occurences since we’ve done this the cycle has always stopped the runaway error condition for us.  The thread always seems to resume normal operation after the cycling.

        • #74302
          Jim Rawls
          Participant

            Hi Ed,

            Thanks for sharing your experience with this maddening error.  It was indeed a VPN error which, after a few conference calls with everyone even remotely involved, was caused by a temporary network error.  Since there is no built-in Cloverleaf remedy, we may have to consider doing something like what you’re doing to scan the logs.

            I don’t think Cloverleaf should act this way, writing to the log every x milliseconds that it retries, and have asked to have this logged as a bug. Even scanning every 2 minutes for this error can still let it get out of hand very quickly.

          • #74303
            David Barr
            Participant

              I’ve had the same thing happen. I think we were on version 5.5 at the time.

            • #74304
              Keith McLeod
              Participant

                I have had a similar experience on 5.7.

              • #74305
                Leon Tieleman
                Participant

                  This is one of the most annoying problems we have at our customers at the moment.

                  We reported it to support several times. R&D fixed something in 5.7 rev2 and 5.8.0.0 but it look like it was a specific fix for only one specific PDL error and not for all the errors. It is still occurring in 5.7 Rev2 and higher. I still hope there is a chance this will be fixed for all the different types of errors soon.

                  Release notes 5.7 Rev2

                  Quote:

                  9.1.3         PDL error fills up the disk space in the VPN environment (6248)

                  Errors occur when using TCP-MLP through VPN. The same error echo in the process log until the process panic because the logs filled up the disk.

                  For example:

                  [pdl :PDL :ERR /0: bno31bb_out:01/27/2009 17:04:44] read returned error 0 (Success)

                  When this error occurs, the thread stays in an UP status because there was not a graceful shutdown from the VPN.

                  This error no longer occurs. A sleep interval has been added for retrying the connection, and the engine will now detect if there is an error and put the thread in error state.

                  Release notes 5.8.0.0

                  Quote:

                  6.4         PDL error fills up the disk space in VPN environment (5742)

                  An issue has been reported with using TCP-MLP thru VPN and getting errors. The same error echoes in the process log until the process panics because the logs fill up the disk.

                  When this error occurs, the thread stays in an UP status because there is not a graceful shutdown from the VPN.

                  This error no longer occurs. Now, a sleep interval retries the connection. The engine will now detect there is an error and put the thread in an error state.

                  Some examples of errors:

                  Code:

                  [pdl :PDL :ERR /0:alert_acc_ADT_out:05/11/2011 11:13:32] write of 636 bytes failed: Unknown error
                  [pdl :PDL :ERR /0:alert_acc_ADT_out:05/11/2011 11:13:32] PDL signaled exception: code 1, msg write failure
                  …..
                  [pdl :PDL :ERR /0: star:02/09/2009 11:18:02] read returned error 110 (Connection timed out)
                  …..
                  [pdl :PDL :ERR /0:  PatTerm_ADT:09/03/2009 05:33:16] read failed: Connection timed out
                  [pdl :PDL :ERR /0:  PatTerm_ADT:09/03/2009 05:33:16] read returned error 34 (Numerical result out of

                • #74306
                  Chris Williams
                  Participant

                    We have also experienced this issue with VPN connections and resolved it. The thousands of lines of errors in the log are just a symptom of the problem. We decided to fix the problem itself.

                    There are multiple pieces of equipment between Cloverleaf and the site at the other end of the VPN. Any one of them can time out and shut the connection down without the two endpoints knowing. Most systems default their time-out value to 2 hours or greater. If one of these pieces of equipment has a shorter time-out, then you get this flood of errors, because the connection was not shut down “gracefully”.

                    Our solution was to set the “keep-alive” value on the Cloverleaf box to be shorter than the shortest time-out value for all the equipment used in the VPN connection. (For us, it was one of the routers causing the problem.) We switched the Cloverleaf box from the default of 2 hours down to 15 minutes. That way, the connection has a burst of traffic every 15 minutes, and the problem child never is allowed to time-out.

                  • #74307
                    Jonathan Davis
                    Participant

                      We’re running 5.6 on Redhat and encountered this same scenario with one major difference – we had it happen on an non-vpn connection. I don’t know if this sheds any light on the subject or not but what happened is that the receiving system (HPF) appeared to have encountered a problem but failed to shut down properly – I believe that I remember seeing the thread status was “up” (I wouldn’t swear to it) but the log/err files were logging at a rate of several hunderd messages a second. I came to Clovertech to see if anyone else had this problem.

                      If there is a case logged with Quovadx it might be of interest to know that there has been at least one instance when this happened to a connection that wasn’t going through a tunnel.

                    • #74308
                      Jim Rawls
                      Participant

                        Chris, thanks for the TCP keep-alive info.  Our network admin team informed us that the default TCP timeout in the VPN concentrator was 60 minutes.  They created a policy that causes it not to timeout at all between Cloverleaf and the destination subnets.  Time will tell if it solves our issue, but we’ve had no recurrence.

                    Viewing 7 reply threads
                    • The forum ‘Cloverleaf’ is closed to new topics and replies.