News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Regular expressions library

Started by gabor, November 28, 2006, 08:38:27 AM

Previous topic - Next topic

gabor

Hello friends!

Some time ago PBrennick asked me in a PM referring to an earlier PM whether I am still feeling like creating a reg exp library for supporting regular expression based matching. Well, I am still interested, but at first sight this is not an easy task, so I'd like to have some help. I think of comments, discussions and ideas in the first place.
As an opening I'd like to share the small specification I created while looking for Unix-like regular expression on the net.



Unix like regular expression processing

1. Basics

  • Regular expressions are between '/' characters.

2. Modifiers

  • i - case-insensitive match
  • g - global match
  • x - ignore whitespaces

3. Metacharacters must be used with \ or they have special  interpretation. (/\\/ matches a '\' character)

  • \ | () [ { ^ $ * + ? .

4. Special characters

  • \d - a digit
  • \D - a non-digit
  • \w - a word character (alphanumeric)
  • \W - a non-word character
  • \t - tabulator
  • \n - line feed
  • \r - carriage return
  • \s - a whitespace character (\t,\n,\r,' ')
  • \S - a non-whitespace character
  • \b - word boundary
  • \B - non-word boundary

5. Matching

  • ^ - matches beginning of a string
  • $ - matches end of a string
  • . - matches any character

6. Quantifying

  • * - 0 or more times (Ex. /a*/ matches '','a','aa'...)
  • + - 1 or more times (Ex. /a+/ matches 'a','aa'...)
  • ? - 0 or 1 time (Ex. /a?/ matches '' or 'a' only)
  • {n} - exactly n times (Ex. /a{5}/: 'aaaaa')

7. Misc

  • [...] - characters in the list (Ex. /[abc0-9]/ matches 'a','b','c' and digits)
  • [^...] - character that are not in the list (Ex. /[^a-bA-b]/ does not match letters)
  • | - alternation/or (Ex. /a|b/ matches 'a' or 'b')
  • () - grouping, creates an atom that can be also referenced later
  •   (Ex. /(ab){2}/ matches 'abab', but ab{2} matches 'abb', /(ab)cd\1/ matches 'abcdab')[/li]



I hope this specification cover as much functionality as possible.
My problem is that this is were my knowledge neary ends :toothy. Though I have an idea but I am afraid it is not that good...
How would you implement reg.exp? Do you know about a resource about this topic on the net? Of course a fast matching method is a demand here...

Greets, Gábor

stanhebben

Hmm, I'll take a look at it tomorrow. At the uni, we learned quite a bit about regular expressions last year, so maybe that could help.

PBrennick

Gábor,
That list looks pretty comprehensive. Of special interest to Me are MetaCharacters,especially ^ $ which I use almost everday while manipulating text files. They represent The beginning of a line and the end of a line.

Paul
The GeneSys Project is available from:
The Repository or My crappy website