SUMMARY: MODULE | CLASS | TYPE | PROC | VAR | CONST | DETAIL: TYPE | PROC | VAR | CONST |
This module implements a DFA based regular expression pattern matcher. In case of a successful match the longest left-most match is returned. (That is, none of the current test cases fails to report the longest match...). With a pattern length of M and a string length of N, a regular expression match can require up to O(M*N) operations, and a regular expression search up to O(M*N^2) operations.
Limitations: Due to its DFA nature, the matcher cannot support extraction of matched groups. The number of parenthesis groups in a pattern is limited to 127. The implementation is not optimized for speed, except for using Boyer-Moore for constant pattern prefixes.
Note: This description is a subset of the Python 2.1 documentation chapter `4.2.1 Regular Expression Syntax' at http://python.org/doc/2.1/lib/re-syntax.html.
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
Regular expressions can be concatenated to form new regular expressions; if `A' and `B' are both regular expressions, then `AB' is also an regular expression. If a string p matches `A' and another string q matches `B', the string pq will match `AB'. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here.
A brief explanation of the format of regular expressions follows. For further information and a gentler presentation, consult the Regular Expression HOWTO, accessible from http://www.python.org/doc/howto/.
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like `A', `a', or `0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so `last' matches the string "last".
Some characters, like "|" or "(", are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.
The special characters are:
`.' | (Dot.) Matches any character except a newline. |
`^' | (Caret.) Matches the start of the string. |
`$' | Matches the end of the string. `foo' matches both "foo" and " foobar", while the regular expression `foo$' matches only "foo". |
`*' | Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. `ab*' will match "a", "ab", or "a" followed by any number of "b"s. |
`+' | Causes the resulting RE to match 1 or more repetitions of the preceding RE. `ab+' will match "a" followed by any non-zero number of "b"s; it will not match just "a". |
`?' | Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. `ab?' will match either "a" or "ab". |
`\' | Either escapes special characters (permitting you to match characters like " *", "?", and so forth), or signals a special sequence; special sequences are discussed below. |
`[...]' | Used to indicate a set of characters. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a `-'. Special characters are not active inside sets. For example, `[akm$]' will match any of the characters "a", " k", "m", or "$"; `[a-z]' will match any lowercase letter, and `[a-zA-Z0-9]' matches any letter or digit. Character classes such as `\w' or `\S' (defined below) are also acceptable inside a range. If you want to include a "]" or a "-" inside a set, precede it with a backslash, or place it as the first character. The pattern `[]]' will match "]", for example. You can match the characters not within a range by complementing the set. This is indicated by including a `^' as the first character of the set; `^' elsewhere will simply match the "^" character. For example, `[^5]' will match any character except "5". |
`|' | `A|B', where `A' and `B' can be arbitrary REs, creates a regular expression that will match either "A" or "B". An arbitrary number of REs can be separated by the `|' in this way. This can be used inside groups (see below) as well. REs separated by `|' are tried from left to right, and the first one that allows the complete pattern to match is considered the accepted branch. This means that if A matches, B will never be tested, even if it would produce a longer overall match. In other words, the `|' operator is never greedy. To match a literal " |", use `\|', or enclose it inside a character class, as in `[|]'. |
`(...)' | Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group. To match the literals "(" or ")", use `\(' or `\)', or enclose them inside a character class: `[(]' `[)]'. |
`\d' | Matches any decimal digit; this is equivalent to the set `[0-9]'. |
`\D' | Matches any non-digit character; this is equivalent to the set `[^0-9]'. |
`\s' | Matches any whitespace character; this is equivalent to the set `[\t\n\r\f\v ]'. |
`\S' | Matches any non-whitespace character; this is equivalent to the set `[^\t\n\r\f\v ]'. |
`\w' | Matches any alphanumeric character; this is equivalent to the set `[a-zA-Z0-9_]'. |
`\W' | Matches any non-alphanumeric character; this is equivalent to the set `[^a-zA-Z0-9_]'. |
`\\' | Matches a literal backslash. |
Object Object RT0 StringSearch
Class List | |
Factory | The matcher factory. |
MatchObject | |
Matcher | |
Node |
Class Summary: Factory [Detail] | |
+---RT0.Object | +---Object.Object | +---StringSearch.Factory | +--StringSearch:RegexpDFA.Factory The matcher factory. | |
Constructor Summary | |
InitFactory(Factory) | |
Method Summary | |
Compile(String8, Flags): Matcher Compile a expression, for example a regular expression pattern, into a Matcher expression object. | |
Inherited Methods | |
From RT0.Object: From Object.Object: From StringSearch.Factory: |
Class Summary: MatchObject [Detail] | |
+---RT0.Object | +---Object.Object | +---StringSearch.MatchObject | +--StringSearch:RegexpDFA.MatchObject | |
Inherited Fields | |
From StringSearch.MatchObject: | |
Method Summary | |
End(LONGINT): LONGINT Returns the index of the end of the substring matched by group. | |
Start(LONGINT): LONGINT Returns the index of the start of the substring matched by group. | |
Inherited Methods | |
From RT0.Object: From Object.Object: From StringSearch.MatchObject: |
Class Summary: Matcher [Detail] | |
+---RT0.Object | +---Object.Object | +---StringSearch.Matcher | +--StringSearch:RegexpDFA.Matcher | |
Inherited Fields | |
From StringSearch.Matcher: | |
Method Summary | |
Match(String8, LONGINT, LONGINT): MatchObject Like Matcher.MatchChars, but works on an instance of Object.String8. | |
MatchChars(ARRAY OF CHAR, LONGINT, LONGINT): MatchObject Returns a corresponding MatchObject instance, if zero or more characters at the beginning of string match this Matcher. | |
Search(String8, LONGINT, LONGINT): MatchObject Like Matcher.SearchChars, but works on an instance of Object.String8. | |
SearchChars(ARRAY OF CHAR, LONGINT, LONGINT): MatchObject Scans through string looking for a location where this Matcher produces a match, and return a corresponding MatchObject instance. | |
Inherited Methods | |
From RT0.Object: From Object.Object: From StringSearch.Matcher: |
Class Summary: Node [Detail] | |
+--StringSearch:RegexpDFA.Node |
Type Summary | |
[GroupId] = SHORTINT |
Variable Summary | |
factory-: Factory |
Constant Summary | |
copyString See StringSearch.copyString. | |
ignoreCase See StringSearch.ignoreCase. |
Class Detail: Factory |
Constructor Detail |
PROCEDURE InitFactory(f: Factory)
Method Detail |
PROCEDURE (f: Factory) Compile(pattern: String8; flags: Flags): Matcher
Compile a expression, for example a regular expression pattern, into a Matcher expression object. The matcher object can be used for matching using its Matcher.MatchChars and Matcher.SearchChars methods.
The pattern's behaviour can be modified by specifying a flags value. The set can include of the following variables: ignoreCase, copyString.
Result is NIL if the given pattern is invalid.
Pre-condition: The value of pattern does not contain the character 0X.
[Description inherited from Compile]
Redefines: Compile
Class Detail: MatchObject |
Method Detail |
PROCEDURE (m: MatchObject) End(group: LONGINT): LONGINT
Returns the index of the end of the substring matched by group. See MatchObject.Start.
[Description inherited from End]
Redefines: End
PROCEDURE (m: MatchObject) Start(group: LONGINT): LONGINT
Returns the index of the start of the substring matched by group. A group of `0' refers to the whole matched substring. Returns `-1' if the group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g is `string[m.Start(g), m.End(g)['. Note that `m.Start(g)' will equal `m.End(g)' if g matched a null string.
Note: Not all matcher implementations implement groups other than the whole match.
Pre-condition: `0 <= group <= m.matcher.groups'
[Description inherited from Start]
Redefines: Start
Class Detail: Matcher |
Method Detail |
PROCEDURE (matcher: Matcher) Match(string: String8; pos: LONGINT; endpos: LONGINT): MatchObject
Like Matcher.MatchChars, but works on an instance of Object.String8.
[Description inherited from Match]
Redefines: Match
PROCEDURE (matcher: Matcher) MatchChars(string: ARRAY OF CHAR; pos: LONGINT; endpos: LONGINT): MatchObject
Returns a corresponding MatchObject instance, if zero or more characters at the beginning of string match this Matcher. Returns NIL if the string does not match the pattern. Note that this is different from a zero-length match.
Note: If you want to locate a match anywhere in string, use Matcher.Search instead.
The second parameter pos gives an index in the string where the search is to start, for example 0 to start at the beginning of the string.
The parameter endpos limits how far the string will be searched. It will be as if the string is endpos characters long, so only the characters in `[pos, endpos[' will be searched for a match. A value of `-1' is equivalent to an endpos of `Length(string)'.
Pre-condition: The start position is within the string `0 <= pos <= Length(string)', and the given end position is either `-1', or between the start position and the end of the string `pos <= endpos <= Length(string)'.
[Description inherited from MatchChars]
Redefines: MatchChars
PROCEDURE (matcher: Matcher) Search(string: String8; pos: LONGINT; endpos: LONGINT): MatchObject
Like Matcher.SearchChars, but works on an instance of Object.String8.
[Description inherited from Search]
Redefines: Search
PROCEDURE (matcher: Matcher) SearchChars(string: ARRAY OF CHAR; pos: LONGINT; endpos: LONGINT): MatchObject
Scans through string looking for a location where this Matcher produces a match, and return a corresponding MatchObject instance. Returns NIL if no position in the string matches the pattern. Note that this is different from finding a zero-length match at some point in the string.
The pos and endpos parameters have the same meaning as for the Matcher.MatchChars method.
[Description inherited from SearchChars]
Redefines: SearchChars
Class Detail: Node |
Type Detail |
TYPE [GroupId] = SHORTINT
Variable Detail |
VAR factory-: Factory
Constant Detail |
CONST copyString
CONST ignoreCase