› Clovertech Forums › Read Only Archives › Cloverleaf › Cloverleaf › Thread doesn’t recognize lost connection
I have checked the logs (don’t see anything out of the ordinary), changed ports and rebuilt the thread and I’m still experiencing the same problem.
Thanks in advance for your help.
Glenn
Fortunately not every 3 minutes, which would be too often for my workarounds to be of much value.
I just see it happen once in a while and have been attributing it to what I call network hick-ups that confuse the interface(s).
An example of what I call a network hick-up is, I notice a hung interface and can’t telnet or ping the server for 5 minutes then everything is suddenly working and the network group says they see no problem because it already went away.
It is no surpirse I see this condition even more with interfaces going thru a VPN.
What I do is setup an alert based on outbound queue depth and that will automatically recycle the interface if it builds up to N messages for some period of time.
For inbound threads I setup an alert on last received that will automatically cycle the interface if inactive for too long.
If you can get every one to agree to send a dummy message once a minute thru the integration of interest then the last received alerts can become very proactive.
Russ Ross
RussRoss318@gmail.com
decrease the OS’s keep alive time..
Dennis
What I do is setup an alert based on outbound queue depth and that will automatically recycle the interface if it builds up to N messages for some period of time.
Russ-
Would you be willing to share your solution/code of how you did this? I have a similar situation where I need to implement.
thanks,
Gary
The keep alive has fixed some of them. That is where I would start.
But will not fix all of them.
The problem is the OS to OS communication is getting lost that tells the server the client is disconnecting. This is typical of things going over a vpn connection where the vpn times out the connection on no activity.
I have seen all varieties to the point any more I just pick a workaround like running a script that says no activity with thread = up cycle the thread.
Or just cycle the thread every morning at 6:30.
Here are some sample entries in my default.alrt file to illustrate how I check and alert for nothing received in a long time.
{VALUE lastr} {SOURCE ib_ap_results_8066} {MODE actual} {WITH 1} {COMP {>= 900}} {FOR {nmin 10}} {WINDOW {* * 8-18 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekday__ib_ap_results_8066 ‘recycling ib_ap_results_8066, nothing received for a long time’ ib_ap_results_8066 ib_ap}}}}
{VALUE lastr} {SOURCE ib_pathnet_8015} {MODE actual} {WITH 1} {COMP {>= 900}} {FOR {nmin 10}} {WINDOW {* * 7-18 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekday__ib_pathnet_8015 ‘LIS Admins – recycle Pathnet queue manager HF1 because it has not sent cloverleaf ib_pathnet_8015 anything for at least ten minutes’ ib_pathnet_8015 ib_pathnet_8015}}}}
Here are some sample entries in my default.alrt file to illustrate how I check and alert for queue depth too large.
{VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 200}} {FOR {nmin 10}} {WINDOW {* * 7-18 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekday__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth too large’ ob_di_dictation_2561 ib_pathnet_8015}}}}
{VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 200}} {FOR {nmin 10}} {WINDOW {* * 0-6,19-23 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weeknight__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth too large’ ob_di_dictation_2561 ib_pathnet_8015}}}}
{VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 200}} {FOR {nmin 10}} {WINDOW {* * * * * 6,0}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekend__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth too large’ ob_di_dictation_2561 ib_pathnet_8015}}}}
{VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 2000}} {FOR {nmin 10}} {WINDOW {* * 7-18 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekday__hub_team,weekday__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth at 2000 msgs’ ob_di_dictation_2561 ib_pathnet_8015}}}}
{VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 2000}} {FOR {nmin 10}} {WINDOW {* * 0-6,19-23 * * 1-5}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weeknight__hub_team,weeknight__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth at 2000 msgs’ ob_di_dictation_2561 ib_pathnet_8015}}}}
{VALUE opque} {SOURCE ob_di_dictation_2561} {MODE actual} {WITH 1} {COMP {> 2000}} {FOR {nmin 10}} {WINDOW {* * * * * 6,0}} {HOST {}} {ACTION {{exec {recycle_thread_alert.ksh weekend__hub_team,weekend__ob_di_dictation_2561 ‘recycling ob_di_dictation_2561 – queue depth at 2000 msgs’ ob_di_dictation_2561 ib_pathnet_8015}}}}
Here is the recycle_thread_alert.ksh script that cycles the thread and sends the emails and pages:
#!/usr/bin/ksh
# Begin Module Header ==============================================================================
#
#——
# Name:
#——
#
# recycle_thread_alert.ksh
#
#———
# Purpose:
#———
#
# – create a time stamped log entry in an alerts log file called
# $HCISITEDIR/Alerts/$thread_name.log
# – send e-mail notification to $email_addresses (which could be somebodies pager)
# with the specified $email_subject
# – recycle the specified $thread_name
#
#——–
# Inputs:
#——–
#
# $1 = email_addresses
# $2 = email_subject
# $3 = thread_name
# $4 = process_name
#
#——-
# Notes:
#——-
#
# Use the alerts configurator to configure alerts to call this script.
#
# Look at /etc/aliases to see all the sendmail email aliases.
#
# Look in directory $HCISITEDIR/Alerts to view the alert messages in the *.log files.
#
# Example of normal usage:
#
# recycle_thread_alert.ksh
# weekday__p_cbord_adt
# ‘Recycling p_maxsysii because interface is not up’
# p_cbord_adt
# adtansils
#
#
# hcimsiutil “Proto Status”:
#
# 0 = thread is dead
# 1 = thread is opening
# 2 = thread is up
# 3 = thread is down
#
#———
# History:
#———
#
# 2000.04.03 Russ Ross
# – wrote initial version.
#
# 2000.11.24 Russ Ross
# – modified check to see if the alert is turned off to use the ls command so that
# symbolic links would casue an alert to be turned off, for example:
# ln -s /dev/null ib_dms_8053.off
# this can be used as a visual aid to tell which alerts have been turned off temporarily
#
# 2002.04.02 Russ Ross
# – modified to start the process and then the thread if the pid file does not exist
#
# 2002.09.12 Russ Ross
# – modified to use /usr/bin/ksh instead of /bin/ksh
#
# 2003.01.03 Russ Ross
# – added logic to send out alerts for ib threads
# * if nothing has been received since recycling 120 seconds ago
# – added logic to reduce unecessary alerts by doing the following for non-ib threads
# * wait 120 seconds after recycling the thread to see if it goes UP
# * or see if que depth is greater than 200
# – added logic to explode out the email_addresses and record them in the alert log file
#
# 2007.04.27 Russ Ross
# – modified to send out alert in the body of the email instead of the subject of the email
# because the new 2-way pagers truncate the subject line to about 35 characters
#
# End of Module Header =============================================================================
#———————–
# define input variables
#———————–
email_addresses=$1
email_subject=$2
thread_name=$3
process_name=$4
#————————————–
# define functions local to this script
#————————————–
function recycle_thread {
if [ ! -f $HCISITEDIR/exec/processes/$process_name/pid ]; then
hcienginerun -p $process_name
sleep 5
fi
hcicmd -p $process_name -c “$thread_name pstop”
sleep 5
hcicmd -p $process_name -c “$thread_name pstart”
}
function log_alert {
echo “” >>$HCISITEDIR/Alerts/$thread_name.log
echo ================================================================================ >>$HCISITEDIR/Alerts/$thread_name.log
echo “” >>$HCISITEDIR/Alerts/$thread_name.log
date +”%a %b %d %Y %r ($email_subject)” >>$HCISITEDIR/Alerts/$thread_name.log
echo “” >>$HCISITEDIR/Alerts/$thread_name.log
echo “Below is a list of who notification of this alert was sent to:” >>$HCISITEDIR/Alerts/$thread_name.log
echo “If there are no addresses, then only the thread was cycled but no notification was sent!” >>$HCISITEDIR/Alerts/$thread_name.log
echo “” >>$HCISITEDIR/Alerts/$thread_name.log
}
function send_alert {
log_alert
sendmail -bv $email_addresses | awk ‘{print $1}’ >>$HCISITEDIR/Alerts/$thread_name.log
echo “Subject: n$email_subject\n.” | sendmail $email_addresses
}
#———————————————————————
# do not do anything if the alert has been toggled off for this thread
#———————————————————————
# if [ ! -f $HCISITEDIR/Alerts/$thread_name.off ]; then
if [ ! “`ls $HCISITEDIR/Alerts/$thread_name.off 2>/dev/null`” ]; then
#—————————————–
# log the fact that an alert got triggered
#—————————————–
log_alert
#———————————————–
# always try to recycle the thread at least once
#———————————————–
recycle_thread
#———————————————————–
# sleep 120 seconds, then evaluate if need to send out alert
#———————————————————–
sleep 120
proto_status=`hcimsiutil -dd $thread_name | grep “^Proto Status” | awk -F: ‘{print $2}’ | tr -d ‘ ‘`
ob_que_depth=`hcimsiutil -dd $thread_name | grep “^OB Data QD” | awk -F: ‘{print $2}’ | tr -d ‘ ‘`
ib_last_received=`hcimsiutil -dd $thread_name | grep “^Proto Last Rd” | awk -F: ‘{print $2}’ | tr -d ‘ ‘`
#————————————————————-
# send out alert for ib threads
# if nothing has been received since recycling 120 seconds ago
#————————————————————-
if [[ “`echo $thread_name | colrm 3`” = “ib” ]] && [[ “$ib_last_received” = “never” ]]; then
send_alert
exit
fi
#———————————————————————–
# send out alert for non-ib threads
# if thread is still not UP or if outbound que depth is greater than 200
#———————————————————————–
if [[ “$proto_status” != “2” ]] || [[ $ob_que_depth > 200 ]]; then
send_alert
fi
fi
There is a problem when several alerts go off at the same instant and step on each other writing to the log file.
One of the many things on my wish list but it is good enough mostly because we have many smaller sites instead of fewer larger sites.
Russ Ross
RussRoss318@gmail.com
An even more proactive type of alert developed by our newest team member (Gordon Templeton) will recycle an outbound thead based on number of resends.
This is superior to queue depth alerts becase it can be triggered if the que depth is only one message.
Use this proc ( tps_reset_resend_count.tcl ) as the first TPS inbound reply proc:
# Begin Module Header ==========================================================
#
# —–
# Name:
# —–
#
# tps_reset_resend_count.tcl
#
# ——–
# Purpose:
# ——–
#
# Reset counter of resent msgs.
# Implement only in “TPS Inbound Reply”
#
# ———–
# Input Args:
# ———–
#
# Args: tps keyedlist containing:
#
# MODE run mode (”start” or “run”)
# MSGID message handle
# ARGS
#
# ———–
# Output Args:
# ———–
#
# Returns: tps keyed list containing dispositions
#
# ——
# Notes:
# ——
#
# UPoC type = TPS
#
#
# ——–
# History:
# ——–
#
# 2008.03.03 Gordon Templeton
# – wrote initial version
#
# 2008.06.09 Russ Ross
# – corrected the name of the counter file from
#
# .tps_email_resends.$HciConnName
#
# to be
#
# .tps_check_resend_count.$HciConnName
#
# so it matches what is being used by TCL proc
# tps_check_resend_count.tcl
#
# End Module Header ============================================================
proc tps_reset_resend_count { args } {
global env HciConnName
global resend_val
keylget args MODE mode ;# What mode are we in
switch -exact — $mode {
start {
}
run {
keylget args CONTEXT ctx
keylget args MSGID mh
set returnList {}
if {$ctx != “sms_ib_reply”} {
echo “$module called with invalid context”
echo “$module should be SMS INBOUND REPLY”
echo “$module continuing msg”
return “{CONTINUE $mh}”
}
#————————————
# Reply received; reset resend_ctr.
#————————————
set resend_val [CtrResetValue .tps_check_resend_count.$HciConnName]
#————————————————————————
# Pass message to next proc in stack:
# kill_ob_save which will null $ob_save
# hcitpsmsgkill which will kill the reply message.
#————————————————————————
lappend returnList “CONTINUE $mh”
return $returnList
}
shutdown {
# Doing some clean-up work
}
default {
echo “Unknown mode in tps_reset_resend_count: ‘$mode'”
return “” ;# Dont know what to do
}
}
}
Use this proc ( tps_check_resend_count.tcl ) as the first TPS Reply generation proc:
# Begin Module Header ==========================================================
#
# —–
# Name:
# —–
#
# tps_check_resend_count.tcl
#
# ——–
# Purpose:
# ——–
#
# Count resent msgs; email alerts when msg count(s) = values defined RESEND_COUNTS
# Implement only in “Reply generation”
#
# ———–
# Input Args:
# ———–
#
# Args: tps keyedlist containing:
#
# MODE run mode (”start” or “run”)
# MSGID message handle
# ARGS keyed list of user arguments containing:
#
# RESEND_COUNTS : list of resend thresholds that trigger an alert notification and recommend avoid using 1
# (default 5 10 20)
#
# EMAIL : email_addresses to sent alerts notifcation to
# (default page_hub_on_call,email_hub_team)
#
# EMAIL_ADDENDUM : email addendum to concatenate to email body automatically generated by this script
# (default “”)
# DEBUG : debug flag Y=on
# (default N)
#
# Example of usage of User ARGS:
#
# {RESEND_COUNTS {5 10 20}}
# {EMAIL {page_hub_on_call,email_hub_team}}
# {EMAIL_ADDENDUM {might need to cycle Iguana NT service}}
# {DEBUG N}
#
#
# ———–
# Output Args:
# ———–
#
# Returns: tps keyed list containing dispositions
#
# ——
# Notes:
# ——
#
# UPoC type = TPS
#
# This is a MDACC proc that is independent of the recover 33 procs
# and runs in conjunction with the another MDACC proc called tps_reset_resend
# that runs in the NetConfig TPS Inbound Reply stack.
#
# Reduce false alerts by not using a resend_count threshold of 1,
# because when cycling an outbound thread or process it is common that the first message sent will be a resend,
# especially if the thread was recycled before receiving an ACK for the last message sent.
#
# ——–
# History:
# ——–
#
# 2008.03.03 Gordon Templeton
# – wrote initial version
#
# 2008.05.14 Russ Ross
# – added USER ARGS
#
# End Module Header ============================================================
proc tps_check_resend_count { args } {
global env HciConnName
global env HciSite
global env HciSiteDir
global resend_val
set module “(tps_check_resend_count/$HciConnName): ”
keylget args MODE mode ;# What mode are we in
switch -exact — $mode {
start {
#——————————————————–
# Always initialize counter file at every thread startup.
#——————————————————–
CtrInitCounter .tps_check_resend_count.$HciConnName 1 99999 1
set resend_value [CtrCurrentValue .tps_check_resend_count.$HciConnName]
return “”
}
run {
keylget args CONTEXT ctx
keylget args MSGID mh
set returnList {}
if {$ctx != “reply_gen”} {
echo “”
echo “===========================================================================”
echo “”
echo “$module is being called from wrong place within Netconfig with a context of ($ctx)”
echo “$module and should be called from within REPLY GENERATION / TIMEOUT HANDLING context of (reply_gen)”
echo “$module so the current message will be continued without taking any other action”
echo “”
echo “===========================================================================”
echo “”
return “{CONTINUE $mh}”
}
#——————————————
# Increment counter
#——————————————-
set resend_val [CtrNextValue .tps_check_resend_count.$HciConnName]
#———————————————-
# get user args, if none provided, set defaults
#———————————————-
if {![keylget args ARGS.RESEND_COUNTS resend_counts]} {
set resend_counts [list 5 10 20]
}
if {![keylget args ARGS.EMAIL email_addresses]} {
set email_addresses page_hub_on_call,email_hub_team
}
if {![keylget args ARGS.EMAIL_ADDENDUM email_addendum]} {
set email_addendum “”
}
if {![keylget args ARGS.DEBUG debug]} {
set debug “N”
}
set debug [string toupper $debug]
#———————-
# echo some debug stuff
#———————-
if { “$debug” == “Y”} {
echo “”
echo “===========================================================================”
echo “”
echo “$module DEBUG INFO”
echo “”
echo “$module resend_val ($resend_val)”
echo “$module resend_counts ($resend_counts)”
echo “$module email_addresse ($email_addresses)”
echo “$module debug ($debug)”
echo “”
echo “===========================================================================”
echo “”
}
#————————————————————————
# Pass message to next proc in stack, resend_ob_msg
#————————————————————————
lappend returnList “CONTINUE $mh”
#——————————————————————————-
# If number of resends equal one of the values in resend_counts list, send email
#——————————————————————————-
if { [lsearch -exact $resend_counts $resend_val] != -1 } {
#———————-
# set email subject
#———————-
set email_subject “Same message resent $resend_val times”
set email_body “thread ($HciConnName) in site ($HciSite)”
set email_body “$email_bodyn$email_addendum”
#——————————————————————-
# Append a time stamped entry to the alerts log file for this thread
#——————————————————————-
set logfile “$HciSiteDir/Alerts/$HciConnName.log”
set logfh [open $logfile a+]
set ts [fmtclock [getclock] “%a %b %d %Y %r “]
puts $logfh “”
puts $logfh “================================================================================”
puts $logfh “”
puts $logfh “$ts($email_subject) ($email_addresses)”
puts $logfh “Below is a list of who notification of this alert was sent to:”
puts $logfh “”
close $logfh
system “sendmail -bv $email_addresses | awk ‘{print $1}’ >>$HciSiteDir/Alerts/$HciConnName.log”
#——————————————————–
# Send email notification of msg resend
#——————————————————–
system “echo “Subject: $email_subjectn$email_body” | sendmail $email_addresses”
echo “”
echo “$module WARNING – resend_count threshold of $resend_val was reached for this msg:”
echo “”
echo “[msgget $mh]”
echo “”
} ;
return $returnList
}
shutdown {
# Doing some clean-up work
}
default {
echo “Unknown mode in tps_check_resend_count: ‘$mode'”
return “” ;# Dont know what to do
}
} ; #end switch
} ; #end proc
Russ Ross
RussRoss318@gmail.com
Thanks for sharing those scripts. I’d tried them out and they work great! 8)
Gary
In your script “tps_check_resend_count” the line of code for the sendmail, which reads:
system “sendmail -bv $email_addresses | awk ‘{print $1}’ >>$HciSiteDir/Alerts/$HciConnName.log”
How does this work? I am not familiar with sendmail, as I have only used mailx.
Thanks again for sharing your code!
Gary
If you are on an AIX server that has the man pages you can type
man sendmail
You could also google or whatever means you have at your disposal.
The particular args (-bv) given in the line of code you are interested in explodes the $email_addresses which are email aliases into the individual addresses, which then is logged so we know exactly who got notified:
for example if i do
sendmail -bv email_hub_team | awk ‘{print $1}’
then I get the following
hci@localhost…
Russ Ross
RussRoss318@gmail.com
Russ Ross
RussRoss318@gmail.com
Russ Ross
RussRoss318@gmail.com
The file “email_hub_team”, where does sendmail pick this file up from?
Do you put this in the tcl_procs directory?
I tried to do something similar with mailx, but I could not get the addresses to output on the same line.
I use the file /etc/aliases to define my email aliases that get used by sendmail.
I have numerous group emaill addresses defined like email_hub_team, page_hub_team, etc.
Typically each thread that has an alert has its own email alias that I control using what I define in the /etc/aliases file.
Let’s say I have a thread called ob_pathnet_22305, then I would have some email aliases defined in /etc/aliases that might be called
weekday__ob_pathnet_22305
weeknight__ob_pathnet_22305
weekend__ob_pathnet_22305
the weekday__ob_pathnet_22305 email alias might include these already defined email aliases
email_hub_team
page_hub_on_call
page_pathnet_on_call
email_pathnet_support
etc.
It would be simple for me to post our /etc/aliases file but I’m a bit uncomfortable showing more email alias contact information than I’ve already done.
We have a ton of emaill addresses, and pagers in that file so I think you can understand me not posting it as a tangible example.
Russ Ross
RussRoss318@gmail.com
Would there be much change to this code on a non-AIX box? (Windows)
Does anyone have screenshots of this in the gui? I to have a VPN tunnel that goes down once a day, and not at the same time, but I would like to cycle it when 100+ messages get queued up.