Read returned error 2 in engine log over 3,000 times/second

This topic has 8 replies, 7 voices, and was last updated 15 years ago by Jim Rawls.

Creator

Topic
May 6, 2011 at 7:13 pm #52462
Jim Rawls
Participant
We have a VPN connection throwing this error over 3,000 times per second:

[pdl :PDL :ERR /0:to_hub_orumdm:05/06/2011 12:08:48] read returned error 2 (No such file or directory)

We are on Cloverleaf 5.7 rev 2 on Linux RH 5. When this happens it goes on for less than a minute but it’s happening more frequently.

Does anyone know anything about this condition?
Creator

Topic

Viewing 7 reply threads

Author

Replies
- May 9, 2011 at 3:55 pm #74301
  Ed Mastascusa
  Participant
  Hi Jim,
  
  our environment is CL 5.5 on AIX 5.3
  
  We’ve had similar issues on 2 of our VPN connections. (never on a non-VPN socket)
  
  In our case the read error was a slightly different number (our threads are using r/t TCP-IP HL7 MLP). Also, our errors would not stop until our log files would hit the max unix file size and the engine process panicked. The errors were fast enough (5K / second) a panic was guarantted within 10 minutes.
  
  The only “solution” we could come up with was a cron job running every 2 minutes that scanned the log files for the phrase “read returned error”. When > 100 of the phrases are in a log file we cycle the process via an hcienginestop and hcienginerun. In some 10s of occurences since we’ve done this the cycle has always stopped the runaway error condition for us. The thread always seems to resume normal operation after the cycling.
- May 13, 2011 at 1:48 pm #74302
  Jim Rawls
  Participant
  Hi Ed,
  
  Thanks for sharing your experience with this maddening error. It was indeed a VPN error which, after a few conference calls with everyone even remotely involved, was caused by a temporary network error. Since there is no built-in Cloverleaf remedy, we may have to consider doing something like what you’re doing to scan the logs.
  
  I don’t think Cloverleaf should act this way, writing to the log every x milliseconds that it retries, and have asked to have this logged as a bug. Even scanning every 2 minutes for this error can still let it get out of hand very quickly.
- May 16, 2011 at 5:47 pm #74303
  David Barr
  Participant
  I’ve had the same thing happen. I think we were on version 5.5 at the time.
- May 16, 2011 at 11:25 pm #74304
  Keith McLeod
  Participant
  I have had a similar experience on 5.7.
- May 19, 2011 at 7:42 am #74305
  Leon Tieleman
  Participant
  This is one of the most annoying problems we have at our customers at the moment.
  
  We reported it to support several times. R&D fixed something in 5.7 rev2 and 5.8.0.0 but it look like it was a specific fix for only one specific PDL error and not for all the errors. It is still occurring in 5.7 Rev2 and higher. I still hope there is a chance this will be fixed for all the different types of errors soon.
  
  Release notes 5.7 Rev2
  
  ~~Quote:~~
  
  9.1.3 PDL error fills up the disk space in the VPN environment (6248)
  
  Errors occur when using TCP-MLP through VPN. The same error echo in the process log until the process panic because the logs filled up the disk.
  
  For example:
  
  [pdl :PDL :ERR /0: bno31bb_out:01/27/2009 17:04:44] read returned error 0 (Success)
  
  When this error occurs, the thread stays in an UP status because there was not a graceful shutdown from the VPN.
  
  This error no longer occurs. A sleep interval has been added for retrying the connection, and the engine will now detect if there is an error and put the thread in error state.
  
  Release notes 5.8.0.0
  
  ~~Quote:~~
  
  6.4 PDL error fills up the disk space in VPN environment (5742)
  
  An issue has been reported with using TCP-MLP thru VPN and getting errors. The same error echoes in the process log until the process panics because the logs fill up the disk.
  
  When this error occurs, the thread stays in an UP status because there is not a graceful shutdown from the VPN.
  
  This error no longer occurs. Now, a sleep interval retries the connection. The engine will now detect there is an error and put the thread in an error state.
  
  Some examples of errors:
  
  Code: [pdl :PDL :ERR /0:alert_acc_ADT_out:05/11/2011 11:13:32] write of 636 bytes failed: Unknown error [pdl :PDL :ERR /0:alert_acc_ADT_out:05/11/2011 11:13:32] PDL signaled exception: code 1, msg write failure ….. [pdl :PDL :ERR /0: star:02/09/2009 11:18:02] read returned error 110 (Connection timed out) ….. [pdl :PDL :ERR /0: PatTerm_ADT:09/03/2009 05:33:16] read failed: Connection timed out [pdl :PDL :ERR /0: PatTerm_ADT:09/03/2009 05:33:16] read returned error 34 (Numerical result out of
- May 20, 2011 at 10:11 pm #74306
  Chris Williams
  Participant
  We have also experienced this issue with VPN connections and resolved it. The thousands of lines of errors in the log are just a symptom of the problem. We decided to fix the problem itself.
  
  There are multiple pieces of equipment between Cloverleaf and the site at the other end of the VPN. Any one of them can time out and shut the connection down without the two endpoints knowing. Most systems default their time-out value to 2 hours or greater. If one of these pieces of equipment has a shorter time-out, then you get this flood of errors, because the connection was not shut down “gracefully”.
  
  Our solution was to set the “keep-alive” value on the Cloverleaf box to be shorter than the shortest time-out value for all the equipment used in the VPN connection. (For us, it was one of the routers causing the problem.) We switched the Cloverleaf box from the default of 2 hours down to 15 minutes. That way, the connection has a burst of traffic every 15 minutes, and the problem child never is allowed to time-out.
- May 23, 2011 at 7:29 pm #74307
  Jonathan Davis
  Participant
  We’re running 5.6 on Redhat and encountered this same scenario with one major difference – we had it happen on an non-vpn connection. I don’t know if this sheds any light on the subject or not but what happened is that the receiving system (HPF) appeared to have encountered a problem but failed to shut down properly – I believe that I remember seeing the thread status was “up” (I wouldn’t swear to it) but the log/err files were logging at a rate of several hunderd messages a second. I came to Clovertech to see if anyone else had this problem.
  
  If there is a case logged with Quovadx it might be of interest to know that there has been at least one instance when this happened to a connection that wasn’t going through a tunnel.
- May 25, 2011 at 5:27 pm #74308
  Jim Rawls
  Participant
  Chris, thanks for the TCP keep-alive info. Our network admin team informed us that the default TCP timeout in the VPN concentrator was 60 minutes. They created a policy that causes it not to timeout at all between Cloverleaf and the destination subnets. Time will tell if it solves our issue, but we’ve had no recurrence.
Author

Replies

Viewing 7 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.