Does TCL regexp "non-capturing" syntax work?

Tagged: regexp non capturing

This topic has 9 replies, 3 voices, and was last updated 5 years, 2 months ago by Charlie Bursell.

Creator

Topic
April 10, 2020 at 3:46 pm #116327
Peter Heggie
Participant
Given this string:

set s1 “<item>Provider/Practice : UHCC (UHCC Adult Care)</item>”

execute the regexp:

regexp {(?:.*Practice.*)$.*$(?:\</item\>)} “$s1” match

result is 1

examine the match contents:

echo $match

result:

<item>Provider/Practice : UHCC (UHCC Adult Care)</item>

The non-capturing tokens should remove all but “(UHCC Adult Care)”, but I’m getting everything. How can I use non-capturing syntax in regexp?

Peter Heggie
PeterHeggie@crouse.org
Creator

Topic

Viewing 7 reply threads

Author

Replies
- April 11, 2020 at 1:42 am #116328
  Charlie Bursell
  Participant
  You are using positive look-ahead where you should be using negative (?!:)
  
  The following returns: (UHCC Adult Care)
  
  regexp {(?!:.*Practice.*)$.*$(?!:\</item\>)} “$s1” match
  
  You could remove the parenthesis.
  
  Look-ahead in regular expressions is tricky. If you want just the “UHC Adult Care” from this string I would write it like: Assumes you only what is inside parenthesis.
  
  regexp — {$(.*?)$} $s1 {} match
  
  Note the {} at the end throws away full match while match will contain UHCC Adult Care
  
  I use look-ahead only when forced to :=)
- April 11, 2020 at 10:18 am #116329
  Peter Heggie
  Participant
  Thank you! you are right – I do not know how to use look-ahead – I will use your syntax.. 🙂
  
  I usually just muddle through regular expressions.. but often I hit land mines.
  
  Peter
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 11, 2020 at 5:14 pm #116330
  Jim Kosloskey
  Participant
  Seems to me this could be done with string functions.
  
  email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- April 14, 2020 at 10:24 am #116369
  Peter Heggie
  Participant
  Jim, yes this could have been done with string functions; I’m saving three or four lines of code by using regular expressions. It is a trade-off when considering everyone’s skill set vs. learning and writing more with regular expressions. Right now I need to get better at regex because we have another product that parses and returns data from PDFs based on regular expressions, but I also just want to know regex better.
  
  Charlie, I got the regex to work using your negative look-ahead example. I also used your syntax to remove the parenthesis by matching on the opening and closing parens, but then throwing out all but the (second) matching string (I needed the additional beginning and ending text strings because parenthesis can also be found in other areas of the discharge instructions):
  
  given string s2 of:
  
  <item>Provider/Practice : SGI, Syracuse Gastroenterology (Syracuse Gastroenterology)</item>
  
  regex of:
  
  regexp {(?!:<item>Provider/Practice.*)$(.*?)$(?!:.*item>)} “$s2” {} match
  
  returns 1 and $match = Syracuse Gastroenterology
  
  Thank you
  
  Peter Heggie
  PeterHeggie@crouse.org
  - April 15, 2020 at 1:58 pm #116397
    Jim Kosloskey
    Participant
    Peter,
    
    I was not implying that string functions over regexp but rather making sure anyone following the post were aware this could be accomplished not using regexp should they not feel comfortable using regexp.
    
    email: jim.kosloskey@jim-kosloskey.com 30+ years Cloverleaf, 60 years IT – old fart.
- April 15, 2020 at 3:32 am #116384
  Charlie Bursell
  Participant
  If you really want to use look-ahead/look behind remember they create a *lot* of overhead and are arcane to the casual user that may have to maintain this later.
  
  I prefer to keep it simple like:
  regexp — {<item>Provider/Practice.*?$(.*?)$} $s2 {} match
  
  If doing something like you did at least provide comments as exactly what you are doing or use the expanded switch with the regex to explain
  
  Example:
  
  set x http://www.infor.com
  
  regexp -expanded — {
  ^ # beginning of string
  [^:]+ # all characters to the first colon
  (?= # begin positive lookahead
  .*\.com$ # for a trailing .com
  ) # end positive lookahead
  } $x match
  <b>=> </b>1 <b>
  </b> echo $match
  => http
  
  Is much better understood than:
  
  regexp {^[^:]+(?=.*\.com$)} $x match
  => 1
  echo $match
  => http
- April 15, 2020 at 8:47 am #116387
  Peter Heggie
  Participant
  ok I can see what you are doing with removing the non-capturing components – as long as the only ‘group’, in parenthesis, is the text I want, then by using the null {} return-variable specifier, then only the group(s) are returned, each going to their own specified (sub) variable. and the first subvar ( ‘match’) points to the first group captured (the only group).
  
  Why is the question mark there? I thought it specified that the preceding character or group was optional? Apparently I need it in both places, otherwise the group fails to match.
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 15, 2020 at 2:26 pm #116398
  Peter Heggie
  Participant
  Jim – sure, understood. Actually I use a lot of string functions all over the place – I can pretty much build anything I want with them. I wrote a crude XML parser with string functions (talk about memory leaks!) and still use some for CCDA processing. In this situation, I thought I could use less code. It also happens to include a ‘start position’ option and an ‘indices’ option to tell me where in the text block my string is, which saves a few more lines also as I iterate through the text block looking for the next string match.
  
  Peter Heggie
  PeterHeggie@crouse.org
- April 16, 2020 at 12:55 am #116409
  Charlie Bursell
  Participant
  The question mark after .* makes the match non-greedy in that it will find the first match. By default, the match is always greedy which means it will make the maximum match. In your case, since there is only one set of parenthesis it was not really needed.
  
  Warning: do not mix greedy/non-greedy it can return weird results
  
  In the examples below the first RE is greedy, and the second is non-greedy:
  
  set x {He sits, but she stands.}
  
  regexp — {.*} $x match; echo $match
  => He sits, but she
  
  regexp — {.*?} $x match; echo $match
  => He
  
  FYI: there are many regular expression tutorials available in Youtube.
Author

Replies

Viewing 7 reply threads

You must be logged in to reply to this topic.