Thread doesn’t recognize lost connection

This topic has 17 replies, 7 voices, and was last updated 13 years, 7 months ago by Carol Peterson.

Creator

Topic
April 4, 2007 at 1:23 pm #49184
Glenn Orlando Pringle
Participant
We are on an AIX box running QDX 5.5 and we have an ORU interface to IMED (Medical Consent). This is a typical mlp_tcp server interface. Since the upgrade to QDX 5.5 from QDX 5.3 the interface will start working (shows green/up) but after approximately 3 min (average) the connection on the IMED side drops but Quovadx shows it as still been Up. A stop and restart on both sides (QDX and IMED) brings everything back to operating conditions.

I have checked the logs (don’t see anything out of the ordinary), changed ports and rebuilt the thread and I’m still experiencing the same problem.

Thanks in advance for your help.

Glenn
Creator

Topic

Viewing 16 reply threads

Author

Replies
- April 4, 2007 at 5:53 pm #61008
  Russ Ross
  Participant
  I have experienced what you are describing my entire cloverleaf career on every version of Cloverleaf I’ve ever used.
  
  Fortunately not every 3 minutes, which would be too often for my workarounds to be of much value.
  
  I just see it happen once in a while and have been attributing it to what I call network hick-ups that confuse the interface(s).
  
  An example of what I call a network hick-up is, I notice a hung interface and can’t telnet or ping the server for 5 minutes then everything is suddenly working and the network group says they see no problem because it already went away.
  
  It is no surpirse I see this condition even more with interfaces going thru a VPN.
  
  What I do is setup an alert based on outbound queue depth and that will automatically recycle the interface if it builds up to N messages for some period of time.
  
  For inbound threads I setup an alert on last received that will automatically cycle the interface if inactive for too long.
  
  If you can get every one to agree to send a dummy message once a minute thru the integration of interest then the last received alerts can become very proactive.
  
  Russ Ross
  RussRoss318@gmail.com
- April 4, 2007 at 7:37 pm #61009
  Dennis Pfeifer
  Participant
  Try reviewing this thread…
  
  https://usspvlclovertch2.infor.com/viewtopic.php?t=734
  
  decrease the OS’s keep alive time..
  
  Dennis
- July 10, 2008 at 6:59 pm #61010
  Gary Atkinson
  Participant
  ~~Quote:~~
  
  What I do is setup an alert based on outbound queue depth and that will automatically recycle the interface if it builds up to N messages for some period of time.
  
  Russ-
  
  Would you be willing to share your solution/code of how you did this? I have a similar situation where I need to implement.
  
  thanks,
  
  Gary
- July 10, 2008 at 8:02 pm #61011
  John Hamilton
  Participant
  I wish there was a good answer for this but there is not one.
  
  The keep alive has fixed some of them. That is where I would start.
  
  But will not fix all of them.
  
  The problem is the OS to OS communication is getting lost that tells the server the client is disconnecting. This is typical of things going over a vpn connection where the vpn times out the connection on no activity.
  
  I have seen all varieties to the point any more I just pick a workaround like running a script that says no activity with thread = up cycle the thread.
  
  Or just cycle the thread every morning at 6:30.
- July 10, 2008 at 8:52 pm #61012
  Russ Ross
  Participant
  Gary:
  
  Here are some sample entries in my default.alrt file to illustrate how I check and alert for nothing received in a long time.
  
  Code: {VALUE lastr} {SOURCE ib_ap_results_8066} {MODE actual} {WITH 1} {COMP {>= 900}} {FOR {nmin 10}} {WINDOW {* * 8-18 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekday__ib_ap_results_8066 ‘recycling ib_ap_results_8066, nothing received for a long time’ ib_ap_results_8066 ib_ap}}}} {VALUE lastr} {SOURCE ib_pathnet_8015} {MODE actual} {WITH 1} {COMP {>= 900}} {FOR {nmin 10}} {WINDOW {* * 7-18 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekday__ib_pathnet_8015 ‘LIS Admins – recycle Pathnet queue manager HF1 because it has not sent cloverleaf ib_pathnet_8015 anything for at least ten minutes’ ib_pathnet_8015 ib_pathnet_8015}}}}
  
  Here are some sample entries in my default.alrt file to illustrate how I check and alert for queue depth too large.
  
  Code: {VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 200}} {FOR {nmin 10}} {WINDOW {* * 7-18 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekday__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth too large’ ob_di_dictation_2561 ib_pathnet_8015}}}} {VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 200}} {FOR {nmin 10}} {WINDOW {* * 0-6,19-23 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weeknight__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth too large’ ob_di_dictation_2561 ib_pathnet_8015}}}} {VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 200}} {FOR {nmin 10}} {WINDOW {* * * * * 6,0}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekend__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth too large’ ob_di_dictation_2561 ib_pathnet_8015}}}} {VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 2000}} {FOR {nmin 10}} {WINDOW {* * 7-18 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekday__hub_team,weekday__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth at 2000 msgs’ ob_di_dictation_2561 ib_pathnet_8015}}}} {VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 2000}} {FOR {nmin 10}} {WINDOW {* * 0-6,19-23 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weeknight__hub_team,weeknight__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth at 2000 msgs’ ob_di_dictation_2561 ib_pathnet_8015}}}} {VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 2000}} {FOR {nmin 10}} {WINDOW {* * * * * 6,0}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekend__hub_team,weekend__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth at 2000 msgs’ ob_di_dictation_2561 ib_pathnet_8015}}}}
  
  Here is the recycle_thread_alert.ksh script that cycles the thread and sends the emails and pages:
  
  Code: #!/usr/bin/ksh # Begin Module Header ============================================================================== # #—— # Name: #—— # # recycle_thread_alert.ksh # #——— # Purpose: #——— # # – create a time stamped log entry in an alerts log file called # $HCISITEDIR/Alerts/$thread_name.log # – send e-mail notification to $email_addresses (which could be somebodies pager) # with the specified $email_subject # – recycle the specified $thread_name # #——– # Inputs: #——– # # $1 = email_addresses # $2 = email_subject # $3 = thread_name # $4 = process_name # #——- # Notes: #——- # # Use the alerts configurator to configure alerts to call this script. # # Look at /etc/aliases to see all the sendmail email aliases. # # Look in directory $HCISITEDIR/Alerts to view the alert messages in the *.log files. # # Example of normal usage: # # recycle_thread_alert.ksh # weekday__p_cbord_adt # ‘Recycling p_maxsysii because interface is not up’ # p_cbord_adt # adtansils # # # hcimsiutil “Proto Status”: # # 0 = thread is dead # 1 = thread is opening # 2 = thread is up # 3 = thread is down # #——— # History: #——— # # 2000.04.03 Russ Ross # – wrote initial version. # # 2000.11.24 Russ Ross # – modified check to see if the alert is turned off to use the ls command so that # symbolic links would casue an alert to be turned off, for example: # ln -s /dev/null ib_dms_8053.off # this can be used as a visual aid to tell which alerts have been turned off temporarily # # 2002.04.02 Russ Ross # – modified to start the process and then the thread if the pid file does not exist # # 2002.09.12 Russ Ross # – modified to use /usr/bin/ksh instead of /bin/ksh # # 2003.01.03 Russ Ross # – added logic to send out alerts for ib threads # * if nothing has been received since recycling 120 seconds ago # – added logic to reduce unecessary alerts by doing the following for non-ib threads # * wait 120 seconds after recycling the thread to see if it goes UP # * or see if que depth is greater than 200 # – added logic to explode out the email_addresses and record them in the alert log file # # 2007.04.27 Russ Ross # – modified to send out alert in the body of the email instead of the subject of the email # because the new 2-way pagers truncate the subject line to about 35 characters # # End of Module Header ============================================================================= #———————– # define input variables #———————– email_addresses=$1 email_subject=$2 thread_name=$3 process_name=$4 #————————————– # define functions local to this script #————————————– function recycle_thread { if [ ! -f $HCISITEDIR/exec/processes/$process_name/pid ]; then hcienginerun -p $process_name sleep 5 fi hcicmd -p $process_name -c “$thread_name pstop” sleep 5 hcicmd -p $process_name -c “$thread_name pstart” } function log_alert { echo “” >>$HCISITEDIR/Alerts/$thread_name.log echo ================================================================================ >>$HCISITEDIR/Alerts/$thread_name.log echo “” >>$HCISITEDIR/Alerts/$thread_name.log date +”%a %b %d %Y %r ($email_subject)” >>$HCISITEDIR/Alerts/$thread_name.log echo “” >>$HCISITEDIR/Alerts/$thread_name.log echo “Below is a list of who notification of this alert was sent to:” >>$HCISITEDIR/Alerts/$thread_name.log echo “If there are no addresses, then only the thread was cycled but no notification was sent!” >>$HCISITEDIR/Alerts/$thread_name.log echo “” >>$HCISITEDIR/Alerts/$thread_name.log } function send_alert { log_alert sendmail -bv $email_addresses | awk ‘{print $1}’ >>$HCISITEDIR/Alerts/$thread_name.log echo “Subject: n$email_subject\n.” | sendmail $email_addresses } #——————————————————————— # do not do anything if the alert has been toggled off for this thread #——————————————————————— # if [ ! -f $HCISITEDIR/Alerts/$thread_name.off ]; then if [ ! “`ls $HCISITEDIR/Alerts/$thread_name.off 2>/dev/null`” ]; then #—————————————– # log the fact that an alert got triggered #—————————————– log_alert #———————————————– # always try to recycle the thread at least once #———————————————– recycle_thread #———————————————————– # sleep 120 seconds, then evaluate if need to send out alert #———————————————————– sleep 120 proto_status=`hcimsiutil -dd $thread_name | grep “^Proto Status” | awk -F: ‘{print $2}’ | tr -d ‘ ‘` ob_que_depth=`hcimsiutil -dd $thread_name | grep “^OB Data QD” | awk -F: ‘{print $2}’ | tr -d ‘ ‘` ib_last_received=`hcimsiutil -dd $thread_name | grep “^Proto Last Rd” | awk -F: ‘{print $2}’ | tr -d ‘ ‘` #————————————————————- # send out alert for ib threads # if nothing has been received since recycling 120 seconds ago #————————————————————- if [[ “`echo $thread_name | colrm 3`” = “ib” ]] && [[ “$ib_last_received” = “never” ]]; then send_alert exit fi #———————————————————————– # send out alert for non-ib threads # if thread is still not UP or if outbound que depth is greater than 200 #———————————————————————– if [[ “$proto_status” != “2” ]] || [[ $ob_que_depth > 200 ]]; then send_alert fi fi
  
  There is a problem when several alerts go off at the same instant and step on each other writing to the log file.
  
  One of the many things on my wish list but it is good enough mostly because we have many smaller sites instead of fewer larger sites.
  
  Russ Ross
  RussRoss318@gmail.com
- July 10, 2008 at 9:18 pm #61013
  Gary Atkinson
  Participant
  Do you have to set any environment variables when running the shell script from the alert tool?
- July 10, 2008 at 9:21 pm #61014
  Russ Ross
  Participant
  Gary:
  
  An even more proactive type of alert developed by our newest team member (Gordon Templeton) will recycle an outbound thead based on number of resends.
  
  This is superior to queue depth alerts becase it can be triggered if the que depth is only one message.
  
  Use this proc ( tps_reset_resend_count.tcl ) as the first TPS inbound reply proc:
  
  Code: # Begin Module Header ========================================================== # # —– # Name: # —– # # tps_reset_resend_count.tcl # # ——– # Purpose: # ——– # # Reset counter of resent msgs. # Implement only in “TPS Inbound Reply” # # ———– # Input Args: # ———– # # Args: tps keyedlist containing: # # MODE run mode (”start” or “run”) # MSGID message handle # ARGS # # ———– # Output Args: # ———– # # Returns: tps keyed list containing dispositions # # —— # Notes: # —— # # UPoC type = TPS # # # ——– # History: # ——– # # 2008.03.03 Gordon Templeton # – wrote initial version # # 2008.06.09 Russ Ross # – corrected the name of the counter file from # # .tps_email_resends.$HciConnName # # to be # # .tps_check_resend_count.$HciConnName # # so it matches what is being used by TCL proc # tps_check_resend_count.tcl # # End Module Header ============================================================ proc tps_reset_resend_count { args } { global env HciConnName global resend_val keylget args MODE mode ;# What mode are we in switch -exact — $mode { start { } run { keylget args CONTEXT ctx keylget args MSGID mh set returnList {} if {$ctx != “sms_ib_reply”} { echo “$module called with invalid context” echo “$module should be SMS INBOUND REPLY” echo “$module continuing msg” return “{CONTINUE $mh}” } #———————————— # Reply received; reset resend_ctr. #———————————— set resend_val [CtrResetValue .tps_check_resend_count.$HciConnName] #———————————————————————— # Pass message to next proc in stack: # kill_ob_save which will null $ob_save # hcitpsmsgkill which will kill the reply message. #———————————————————————— lappend returnList “CONTINUE $mh” return $returnList } shutdown { # Doing some clean-up work } default { echo “Unknown mode in tps_reset_resend_count: ‘$mode'” return “” ;# Dont know what to do } } }
  
  Use this proc ( tps_check_resend_count.tcl ) as the first TPS Reply generation proc:
  
  Code: # Begin Module Header ========================================================== # # —– # Name: # —– # # tps_check_resend_count.tcl # # ——– # Purpose: # ——– # # Count resent msgs; email alerts when msg count(s) = values defined RESEND_COUNTS # Implement only in “Reply generation” # # ———– # Input Args: # ———– # # Args: tps keyedlist containing: # # MODE run mode (”start” or “run”) # MSGID message handle # ARGS keyed list of user arguments containing: # # RESEND_COUNTS : list of resend thresholds that trigger an alert notification and recommend avoid using 1 # (default 5 10 20) # # EMAIL : email_addresses to sent alerts notifcation to # (default page_hub_on_call,email_hub_team) # # EMAIL_ADDENDUM : email addendum to concatenate to email body automatically generated by this script # (default “”) # DEBUG : debug flag Y=on # (default N) # # Example of usage of User ARGS: # # {RESEND_COUNTS {5 10 20}} # {EMAIL {page_hub_on_call,email_hub_team}} # {EMAIL_ADDENDUM {might need to cycle Iguana NT service}} # {DEBUG N} # # # ———– # Output Args: # ———– # # Returns: tps keyed list containing dispositions # # —— # Notes: # —— # # UPoC type = TPS # # This is a MDACC proc that is independent of the recover 33 procs # and runs in conjunction with the another MDACC proc called tps_reset_resend # that runs in the NetConfig TPS Inbound Reply stack. # # Reduce false alerts by not using a resend_count threshold of 1, # because when cycling an outbound thread or process it is common that the first message sent will be a resend, # especially if the thread was recycled before receiving an ACK for the last message sent. # # ——– # History: # ——– # # 2008.03.03 Gordon Templeton # – wrote initial version # # 2008.05.14 Russ Ross # – added USER ARGS # # End Module Header ============================================================ proc tps_check_resend_count { args } { global env HciConnName global env HciSite global env HciSiteDir global resend_val set module “(tps_check_resend_count/$HciConnName): ” keylget args MODE mode ;# What mode are we in switch -exact — $mode { start { #——————————————————– # Always initialize counter file at every thread startup. #——————————————————– CtrInitCounter .tps_check_resend_count.$HciConnName 1 99999 1 set resend_value [CtrCurrentValue .tps_check_resend_count.$HciConnName] return “” } run { keylget args CONTEXT ctx keylget args MSGID mh set returnList {} if {$ctx != “reply_gen”} { echo “” echo “===========================================================================” echo “” echo “$module is being called from wrong place within Netconfig with a context of ($ctx)” echo “$module and should be called from within REPLY GENERATION / TIMEOUT HANDLING context of (reply_gen)” echo “$module so the current message will be continued without taking any other action” echo “” echo “===========================================================================” echo “” return “{CONTINUE $mh}” } #—————————————— # Increment counter #——————————————- set resend_val [CtrNextValue .tps_check_resend_count.$HciConnName] #———————————————- # get user args, if none provided, set defaults #———————————————- if {![keylget args ARGS.RESEND_COUNTS resend_counts]} { set resend_counts [list 5 10 20] } if {![keylget args ARGS.EMAIL email_addresses]} { set email_addresses page_hub_on_call,email_hub_team } if {![keylget args ARGS.EMAIL_ADDENDUM email_addendum]} { set email_addendum “” } if {![keylget args ARGS.DEBUG debug]} { set debug “N” } set debug [string toupper $debug] #———————- # echo some debug stuff #———————- if { “$debug” == “Y”} { echo “” echo “===========================================================================” echo “” echo “$module DEBUG INFO” echo “” echo “$module resend_val ($resend_val)” echo “$module resend_counts ($resend_counts)” echo “$module email_addresse ($email_addresses)” echo “$module debug ($debug)” echo “” echo “===========================================================================” echo “” } #———————————————————————— # Pass message to next proc in stack, resend_ob_msg #———————————————————————— lappend returnList “CONTINUE $mh” #——————————————————————————- # If number of resends equal one of the values in resend_counts list, send email #——————————————————————————- if { [lsearch -exact $resend_counts $resend_val] != -1 } { #———————- # set email subject #———————- set email_subject “Same message resent $resend_val times” set email_body “thread ($HciConnName) in site ($HciSite)” set email_body “$email_bodyn$email_addendum” #——————————————————————- # Append a time stamped entry to the alerts log file for this thread #——————————————————————- set logfile “$HciSiteDir/Alerts/$HciConnName.log” set logfh [open $logfile a+] set ts [fmtclock [getclock] “%a %b %d %Y %r “] puts $logfh “” puts $logfh “================================================================================” puts $logfh “” puts $logfh “$ts($email_subject) ($email_addresses)” puts $logfh “Below is a list of who notification of this alert was sent to:” puts $logfh “” close $logfh system “sendmail -bv $email_addresses | awk ‘{print $1}’ >>$HciSiteDir/Alerts/$HciConnName.log” #——————————————————– # Send email notification of msg resend #——————————————————– system “echo “Subject: $email_subjectn$email_body” | sendmail $email_addresses” echo “” echo “$module WARNING – resend_count threshold of $resend_val was reached for this msg:” echo “” echo “[msgget $mh]” echo “” } ; return $returnList } shutdown { # Doing some clean-up work } default { echo “Unknown mode in tps_check_resend_count: ‘$mode'” return “” ;# Dont know what to do } } ; #end switch } ; #end proc
  
  Russ Ross
  RussRoss318@gmail.com
- July 23, 2008 at 12:57 pm #61015
  Gary Atkinson
  Participant
  Ross-
  
  Thanks for sharing those scripts. I’d tried them out and they work great! 8)
  
  Gary
- September 5, 2008 at 12:12 pm #61016
  Gary Atkinson
  Participant
  Ross-
  
  In your script “tps_check_resend_count” the line of code for the sendmail, which reads:
  
  Code: system “sendmail -bv $email_addresses | awk ‘{print $1}’ >>$HciSiteDir/Alerts/$HciConnName.log”
  
  How does this work? I am not familiar with sendmail, as I have only used mailx.
  
  Thanks again for sharing your code!
  
  Gary
- September 5, 2008 at 7:21 pm #61017
  Russ Ross
  Participant
  sendmail is the command line mail utility that comes with our OS which is a flavor of Unix called AIX and perhaps several other flavors of Unix, too.
  
  If you are on an AIX server that has the man pages you can type
  
  man sendmail
  
  You could also google or whatever means you have at your disposal.
  
  The particular args (-bv) given in the line of code you are interested in explodes the $email_addresses which are email aliases into the individual addresses, which then is logged so we know exactly who got notified:
  
  for example if i do
  
  sendmail -bv email_hub_team | awk ‘{print $1}’
  
  then I get the following
  
  hci@localhost…
  
  interhelp@mdanderson.org…
  
  gtemplet@mdanderson.org…
  
  jkoslosk@mdanderson.org…
  
  flsoliman@mdanderson.org…
  
  pconnoll@mdanderson.org…
  
  decole@mdanderson.org…
  
  rross@mdanderson.org…
  
  cmiller@mdanderson.org…
  
  Russ Ross
  RussRoss318@gmail.com
- September 5, 2008 at 7:25 pm #61018
  Russ Ross
  Participant
  I fired off mailx and it seems very similar to sendmail and I would not be surprised if mailx is a wrapper aropund the sendmail program
  
  Russ Ross
  RussRoss318@gmail.com
- September 5, 2008 at 7:27 pm #61019
  Russ Ross
  Participant
  I believe I read some where that mail is a wrapper around the sendmail program if memory serves me correctly.
  
  Russ Ross
  RussRoss318@gmail.com
- September 5, 2008 at 7:32 pm #61020
  Gary Atkinson
  Participant
  One more question 8)
  
  The file “email_hub_team”, where does sendmail pick this file up from?
  
  Do you put this in the tcl_procs directory?
  
  I tried to do something similar with mailx, but I could not get the addresses to output on the same line.
- September 5, 2008 at 7:52 pm #61021
  Russ Ross
  Participant
  email_hub_team isn’t a file; instead it is an email alias that represents a group of individual email addresses.
  
  I use the file /etc/aliases to define my email aliases that get used by sendmail.
  
  I have numerous group emaill addresses defined like email_hub_team, page_hub_team, etc.
  
  Typically each thread that has an alert has its own email alias that I control using what I define in the /etc/aliases file.
  
  Let’s say I have a thread called ob_pathnet_22305, then I would have some email aliases defined in /etc/aliases that might be called
  
  weekday__ob_pathnet_22305
  
  weeknight__ob_pathnet_22305
  
  weekend__ob_pathnet_22305
  
  the weekday__ob_pathnet_22305 email alias might include these already defined email aliases
  
  email_hub_team
  
  page_hub_on_call
  
  page_pathnet_on_call
  
  email_pathnet_support
  
  etc.
  
  It would be simple for me to post our /etc/aliases file but I’m a bit uncomfortable showing more email alias contact information than I’ve already done.
  
  We have a ton of emaill addresses, and pagers in that file so I think you can understand me not posting it as a tangible example.
  
  Russ Ross
  RussRoss318@gmail.com
- September 6, 2008 at 1:06 am #61022
  Gary Atkinson
  Participant
  Thats enough to get me started. Thanks again.
- January 19, 2009 at 6:01 pm #61023
  Bill Tipton
  Participant
  Would there be much change to this code on a non-AIX box? (Windows)
- January 19, 2012 at 9:08 pm #61024
  Carol Peterson
  Participant
  Does anyone have screenshots of this in the gui? I to have a VPN tunnel that goes down once a day, and not at the same time, but I would like to cycle it when 100+ messages get queued up.
Author

Replies

Viewing 16 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.