to SMAT or not to SMAT…

This topic has 7 replies, 5 voices, and was last updated 15 years, 6 months ago by Troy Morton.

Creator

Topic
December 17, 2009 at 4:42 pm #51431
Troy Morton
Participant
AIX 5.3 – 8 Processors, 12gb RAM, SAN Storage, Cloverleaf 5.4, ~1300 threads accross 20 sites.

My team is considering removing the SMATS from our threads that send outbound since we can resend those messages from the inbound thread SMATS.

We also have Tcl code in our validate_HL7_ack proc that writes the original message, the sent message and the ACK reply to a file we call the Trace file. So, we can see in one place, the source message the sent message and whether it was ACKed.

Pulling messaes to resend from these text trace files is not fast or easy because they are formatted to be viewable in Windows. A script could be developed to pull messaes from this file, but I already have lots of great programs that I use to pull messags from the SMAT format.

We want to remove the SMATs because our server is having some I/O related bottleneck issues and we’re trying to cut down the amount of disk I/O until our new hardware is installed in the next few weeks or possibly early 2010.

Does anyone have data or experience that shows disabling SMATs makes any significant impact to server I/O performance?

Are there any gotcha’s that we are not thinking of that could bite us if we don’t have the SMAT files for those outbound threads?

Thanks!
Creator

Topic

Viewing 6 reply threads

Author

Replies
- December 17, 2009 at 5:12 pm #70255
  Jim Kosloskey
  Participant
  Interesting, you have an I/O issue and you think it is SMAT and not your home-grown tracing activity that is causing the issue?
  
  It would seem to me you could get your trace file (or the information it represents) after the fact by processing the SMAT files in a batch environment rather than taking the hit with each interaction.
  
  I wonder if it is really necessary to have the trace file be real time.
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- December 17, 2009 at 5:34 pm #70256
  Troy Morton
  Participant
  Yes, you have a good point.
- December 17, 2009 at 9:32 pm #70257
  Ed Mastascusa
  Participant
  Hi Troy,
  
  possible gotchas to not keeping the outbound SMATs
  
  If any of your XLTs or TPS procs depend upon tables that change, you may have some latency issues. Replaying an old set of inbound data through a newer route/TCL/table/Xlate configuration might not exactly replicate what did in fact happen.
  
  There’s also the hassle of re-playing messages to only one outbound thread if you have a one to many route from he inbound.
- December 17, 2009 at 10:49 pm #70258
  Russ Ross
  Participant
  The good news is that I believe your hardware is adequate to handle your load because your are practicaly describing what we are running on with a similar number of threads with about a 13 million a day message flow.
  
  If the SAN which is probably out of your control is the cause of the disk I/O bottle neck then a new server using the same SAN group might not be a significant improvement.
  
  This was a serious concern I had and ran into when we were first forced to move from local serial disk to SAN.
  
  I spent the next 3 years suffering while the SAN group finally got it runniong fast enough and reliably.
  
  We have a very similar setup of the hardware, OS, cloverleaf, thread count.
  
  We do have more site granualaity but have SMAT on just about everything and keep up nicely with about a 13 milliong message flow per day with a comfortable amount of horse power to spare.
  
  My previous platform did force me to get very creative to stay afloat and I quickly discovered disk I/O to be the single biggest bottleneck which typically is always true of any platform or application.
  
  Since I’ve witnessed first hand the capacity of your hardware is likely adequate, I like Jim suspect your home grown tracing might be suspect and would take a look at it to see if it is opening and closing the file with every message.
  
  You are welcome to call me if you want to leverage my experience since I’ve already made the journey you are talking about embarking on.
  
  One quick thought that pays big returns towards reducing wasted use of resources, is to filter unecessary message as far upstream as possible.
  
  In fact, we don’t allow the use of static raw route and require a route for each message type to help in this effort.
  
  Also, make sure EO config logging is not set to all except during times of troubleshooting.
  
  Okay enough or I will get going and find it hard to stop with so many things that all add up.
  
  I’m curious with such similar hardware what is the most queue dethp of messages you’ve seen without the process panicing?
  
  Our old server tanked at about 8,000 queue depth but on our current server we hit over 150,000 and kept working fine, which is a much nicer cushion, but leaves me wondering what my upper limit might be.
  
  Russ Ross
  RussRoss318@gmail.com
- December 18, 2009 at 3:02 pm #70259
  Russ Ross
  Participant
  It dawned on me that even though I’m running AIX with 8 processors and a healthy amount of RAM like you, that doesn’t mean you are running on P570s like I am.
  
  In my previous post for some reason I had draw a conclusion that you too are running on P570s but that might not be true.
  
  So when I say your machine might be big enough to handle your load, I was talking about P570s or bigger.
  
  Russ Ross
  RussRoss318@gmail.com
- December 18, 2009 at 7:11 pm #70260
  Shibu Mathew
  Participant
  Hi Troy,
  
  At my previous job, working with AIX and eGate, I found that I/O issues were usually related to server hardware- on many occassions it was the controller card. Does the error report (errpt) on AIX show any h/w related errors ? You could also gather statistics from “iostat” to look for SAN performance issues.
  
  Thanks.
- December 21, 2009 at 9:33 pm #70261
  Troy Morton
  Participant
  Russ, we are running 1.4ghz processors on our current server. The new server will be a IBM 550.
  
  I’m not sure about the brand or quality of our SAN. I do know that our new server will have 4 SAN cards and our current one only has a single card with a hot failover card.
  
  I’m going to check our home-grown tracing program to see about the file open/close. I never thought to check that before. Thanks!!
  
  Troy
Author

Replies

Viewing 6 reply threads

The forum ‘Cloverleaf’ is closed to new topics and replies.