StringSearch:RegexpDFA

This module implements a DFA based regular expression pattern matcher. In case of a successful match the longest left-most match is returned. (That is, none of the current test cases fails to report the longest match...). With a pattern length of M and a string length of N, a regular expression match can require up to O(M*N) operations, and a regular expression search up to O(M*N^2) operations.

Limitations: Due to its DFA nature, the matcher cannot support extraction of matched groups. The number of parenthesis groups in a pattern is limited to 127. The implementation is not optimized for speed, except for using Boyer-Moore for constant pattern prefixes.

Note: This description is a subset of the Python 2.1 documentation chapter `4.2.1 Regular Expression Syntax' at http://python.org/doc/2.1/lib/re-syntax.html.

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Regular expressions can be concatenated to form new regular expressions; if `A' and `B' are both regular expressions, then `AB' is also an regular expression. If a string p matches `A' and another string q matches `B', the string pq will match `AB'. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here.

A brief explanation of the format of regular expressions follows. For further information and a gentler presentation, consult the Regular Expression HOWTO, accessible from http://www.python.org/doc/howto/.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like `A', `a', or `0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so `last' matches the string "last".

Some characters, like "|" or "(", are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.

The special characters are:

`.'

(Dot.) Matches any character except a newline.

`^'

(Caret.) Matches the start of the string.

`$'

Matches the end of the string. `foo' matches both "foo" and " foobar", while the regular expression `foo$' matches only "foo".

`*'

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. `ab*' will match "a", "ab", or "a" followed by any number of "b"s.

`+'

Causes the resulting RE to match 1 or more repetitions of the preceding RE. `ab+' will match "a" followed by any non-zero number of "b"s; it will not match just "a".

`?'

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. `ab?' will match either "a" or "ab".

`\'

Either escapes special characters (permitting you to match characters like " *", "?", and so forth), or signals a special sequence; special sequences are discussed below.

`[...]'

Used to indicate a set of characters. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a `-'. Special characters are not active inside sets. For example, `[akm$]' will match any of the characters "a", " k", "m", or "$"; `[a-z]' will match any lowercase letter, and `[a-zA-Z0-9]' matches any letter or digit. Character classes such as `\w' or `\S' (defined below) are also acceptable inside a range. If you want to include a "]" or a "-" inside a set, precede it with a backslash, or place it as the first character. The pattern `[]]' will match "]", for example.

You can match the characters not within a range by complementing the set. This is indicated by including a `^' as the first character of the set; `^' elsewhere will simply match the "^" character. For example, `[^5]' will match any character except "5".

`|'

`A|B', where `A' and `B' can be arbitrary REs, creates a regular expression that will match either "A" or "B". An arbitrary number of REs can be separated by the `|' in this way. This can be used inside groups (see below) as well. REs separated by `|' are tried from left to right, and the first one that allows the complete pattern to match is considered the accepted branch. This means that if A matches, B will never be tested, even if it would produce a longer overall match. In other words, the `|' operator is never greedy. To match a literal " |", use `\|', or enclose it inside a character class, as in `[|]'.

`(...)'

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group. To match the literals "(" or ")", use `\(' or `\)', or enclose them inside a character class: `[(]' `[)]'.

`\d'

Matches any decimal digit; this is equivalent to the set `[0-9]'.

`\D'

Matches any non-digit character; this is equivalent to the set `[^0-9]'.

`\s'

Matches any whitespace character; this is equivalent to the set `[\t\n\r\f\v ]'.

`\S'

Matches any non-whitespace character; this is equivalent to the set `[^\t\n\r\f\v ]'.

`\w'

Matches any alphanumeric character; this is equivalent to the set `[a-zA-Z0-9_]'.

`\W'

Matches any non-alphanumeric character; this is equivalent to the set `[^a-zA-Z0-9_]'.

`\\'

Matches a literal backslash.

Import List

    Object
    Object
    RT0
    StringSearch
 
Class List
FactoryThe matcher factory.
MatchObject
Matcher
Node
Class Summary: Factory [Detail]
  +---RT0.Object
       |
       +---Object.Object
            |
            +---StringSearch.Factory
                 |
                 +--StringSearch:RegexpDFA.Factory

The matcher factory.

Constructor Summary
InitFactory(Factory)

          
Method Summary
Compile(String8, Flags): Matcher

          Compile a expression, for example a regular expression pattern, into a Matcher expression object.
Inherited Methods

From RT0.Object:

          Finalize

From Object.Object:

          Equals, HashCode, ToString

From StringSearch.Factory:

          Compile, Destroy

 
Class Summary: MatchObject [Detail]
  +---RT0.Object
       |
       +---Object.Object
            |
            +---StringSearch.MatchObject
                 |
                 +--StringSearch:RegexpDFA.MatchObject
Inherited Fields

From StringSearch.MatchObject:

          endpos, matcher, pos, string

Method Summary
End(LONGINT): LONGINT

          Returns the index of the end of the substring matched by group.
Start(LONGINT): LONGINT

          Returns the index of the start of the substring matched by group.
Inherited Methods

From RT0.Object:

          Finalize

From Object.Object:

          Equals, HashCode, ToString

From StringSearch.MatchObject:

          Destroy, End, Start

 
Class Summary: Matcher [Detail]
  +---RT0.Object
       |
       +---Object.Object
            |
            +---StringSearch.Matcher
                 |
                 +--StringSearch:RegexpDFA.Matcher
Inherited Fields

From StringSearch.Matcher:

          flags, groups, pattern

Method Summary
Match(String8, LONGINT, LONGINT): MatchObject

          Like Matcher.MatchChars, but works on an instance of Object.String8.
MatchChars(ARRAY OF CHAR, LONGINT, LONGINT): MatchObject

          Returns a corresponding MatchObject instance, if zero or more characters at the beginning of string match this Matcher.
Search(String8, LONGINT, LONGINT): MatchObject

          Like Matcher.SearchChars, but works on an instance of Object.String8.
SearchChars(ARRAY OF CHAR, LONGINT, LONGINT): MatchObject

          Scans through string looking for a location where this Matcher produces a match, and return a corresponding MatchObject instance.
Inherited Methods

From RT0.Object:

          Finalize

From Object.Object:

          Equals, HashCode, ToString

From StringSearch.Matcher:

          Destroy, Match, MatchChars, Search, SearchChars

 
Class Summary: Node [Detail]
  +--StringSearch:RegexpDFA.Node
 
Type Summary
[GroupId] = SHORTINT

          
Variable Summary
factory-: Factory

          
Constant Summary
copyString

          See StringSearch.copyString.
ignoreCase

          See StringSearch.ignoreCase.

Class Detail: Factory
Constructor Detail

InitFactory

PROCEDURE InitFactory(f: Factory)
Method Detail

Compile

PROCEDURE (f: Factory) Compile(pattern: String8; 
                  flags: Flags): Matcher

Compile a expression, for example a regular expression pattern, into a Matcher expression object. The matcher object can be used for matching using its Matcher.MatchChars and Matcher.SearchChars methods.

The pattern's behaviour can be modified by specifying a flags value. The set can include of the following variables: ignoreCase, copyString.

Result is NIL if the given pattern is invalid.

Pre-condition: The value of pattern does not contain the character 0X.

[Description inherited from Compile]

Redefines: Compile

 
Class Detail: MatchObject
Method Detail

End

PROCEDURE (m: MatchObject) End(group: LONGINT): LONGINT

Returns the index of the end of the substring matched by group. See MatchObject.Start.

[Description inherited from End]

Redefines: End


Start

PROCEDURE (m: MatchObject) Start(group: LONGINT): LONGINT

Returns the index of the start of the substring matched by group. A group of `0' refers to the whole matched substring. Returns `-1' if the group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g is `string[m.Start(g), m.End(g)['. Note that `m.Start(g)' will equal `m.End(g)' if g matched a null string.

Note: Not all matcher implementations implement groups other than the whole match.

Pre-condition: `0 <= group <= m.matcher.groups'

[Description inherited from Start]

Redefines: Start

 
Class Detail: Matcher
Method Detail

Match

PROCEDURE (matcher: Matcher) Match(string: String8; 
                pos: LONGINT; 
                endpos: LONGINT): MatchObject

Like Matcher.MatchChars, but works on an instance of Object.String8.

[Description inherited from Match]

Redefines: Match


MatchChars

PROCEDURE (matcher: Matcher) MatchChars(string: ARRAY OF CHAR; 
                     pos: LONGINT; 
                     endpos: LONGINT): MatchObject

Returns a corresponding MatchObject instance, if zero or more characters at the beginning of string match this Matcher. Returns NIL if the string does not match the pattern. Note that this is different from a zero-length match.

Note: If you want to locate a match anywhere in string, use Matcher.Search instead.

The second parameter pos gives an index in the string where the search is to start, for example 0 to start at the beginning of the string.

The parameter endpos limits how far the string will be searched. It will be as if the string is endpos characters long, so only the characters in `[pos, endpos[' will be searched for a match. A value of `-1' is equivalent to an endpos of `Length(string)'.

Pre-condition: The start position is within the string `0 <= pos <= Length(string)', and the given end position is either `-1', or between the start position and the end of the string `pos <= endpos <= Length(string)'.

[Description inherited from MatchChars]

Redefines: MatchChars


Search

PROCEDURE (matcher: Matcher) Search(string: String8; 
                 pos: LONGINT; 
                 endpos: LONGINT): MatchObject

Like Matcher.SearchChars, but works on an instance of Object.String8.

[Description inherited from Search]

Redefines: Search


SearchChars

PROCEDURE (matcher: Matcher) SearchChars(string: ARRAY OF CHAR; 
                      pos: LONGINT; 
                      endpos: LONGINT): MatchObject

Scans through string looking for a location where this Matcher produces a match, and return a corresponding MatchObject instance. Returns NIL if no position in the string matches the pattern. Note that this is different from finding a zero-length match at some point in the string.

The pos and endpos parameters have the same meaning as for the Matcher.MatchChars method.

[Description inherited from SearchChars]

Redefines: SearchChars

 
Class Detail: Node
 
Type Detail

GroupId

TYPE [GroupId] = SHORTINT
Variable Detail

factory

VAR factory-: Factory
Constant Detail

copyString

CONST copyString 

See StringSearch.copyString.


ignoreCase

CONST ignoreCase 

See StringSearch.ignoreCase.