Regular-expression constructs

The regular expressions (or regex for short) used in searches and segmentation rules are those supported by Java. If you need more specific information, please consult http://java.sun.com/j2se/1.5/docs/api/java/util/regex/Pattern.html. See additional references and examples below. 

Flags

Characters

Quotation

Classes for Unicode blocks and  categories

Character classes

Predefined character classes

Boundary matchers

Greedy quantifiers

Reluctant (non-greedy) quantifiers

Logical operators

Regex tools and examples of use

    

 


The construct...

...matches the following:


Flags

(?i)

Enables case-insensitive matching (by default, the pattern is case-sensitive).


Characters

x

The character x, except the following...

\uhhhh

The character with hexadecimal value 0xhhhh

\t

The tab character ('\u0009')

\n

The newline (line feed) character ('\u000A')

\r

The carriage-return character ('\u000D')

\f

The form-feed character ('\u000C')

\a

The alert (bell) character ('\u0007')

\e

The escape character ('\u001B')

\cx

The control character corresponding to x

\0n

The character with octal value 0n (0 <= n <= 7)

\0nn

The character with octal value 0nn (0 <= n <= 7)

\0mnn

The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)

\xhh

The character with hexadecimal value 0xhh


Quotation

\

Nothing, but quotes the following character. This is required if you would like to enter of the meta characters !$()*+.<>?[\]^{|} to match as themselves.

\\

For example, this is the backslash character

\Q

Nothing, but quotes all characters until \E

\E

Nothing, but ends quoting started by \Q


Classes for Unicode blocks and categories

\p{InGreek}

A character in the Greek block (simple block)

\p{Lu}

An uppercase letter (simple category)

\p{Sc}

A currency symbol

\P{InGreek}

Any character except one in the Greek block (negation)

[\p{L}&&[^\p{Lu}]]

Any letter except an uppercase letter (subtraction)


Character classes

[abc]

a, b, or c (simple class)

[^abc]

Any character except a, b, or c (negation)

[a-zA-Z]

a through z or A through Z, inclusive (range)


Predefined character classes

.

Any character (except for line terminators)

\d

A digit: [0-9]

\D

A non-digit: [^0-9]

\s

A whitespace character: [ \t\n\x0B\f\r]

\S

A non-whitespace character: [^\s]

\w

A word character: [a-zA-Z_0-9]

\W

A non-word character: [^\w]


Boundary matchers

^

The beginning of a line

$

The end of a line

\b

A word boundary

\B

A non-word boundary


Greedy quantifiers

These will match as much as they can. For example, a+ will match aaa in aaabbb

X?

X, once or not at all

X*

X, zero or more times

X+

X, one or more times


Reluctant (non-greedy) quantifiers

These will match as little as they can. For example, a+? will match the first a in aaabbb

X??

X, once or not at all

X*?

X, zero or more times

X+?

X, one or more times


Logical operators

XY

X followed by Y

X|Y

Either X or Y

(XY)

XY as a single group



Regex tools and examples of use


There's a number of interactive tools available to develop and test regular expressions. They all pretty much follow the same pattern (see below for an  example ina form of a plug-in for Firefox): the regular expression (top entry)  analyzes the search text (Text box in the middle) , yielding the hits, shown in the result Text box.

regular expressions tester plug-in for FireFox

See The Regex Coach for Windows,Linux, Mac, FreeBSD versions of a stand-alone tool, that's pretty much identical to the above example.

A nice collection  of  useful regex cases can be found in the OmegaT itself (see Options > Segmentation).  The following list includes expressions, you may find useful when searching through the translation memory:

Regular expression Finds the following:
(\b\w+\b)\s\1\b
double words 
[\.,]\s*[\.,]+ t commas and periods  mix-up
\. \s$ extra blanks, following the period at the end of the line
\s+a\s+[aeiou]  English:  words, starting on vowels, should be preceded by "an", not "a"
\s+an\s+[^aeiou]  English: the same check as above, but for consonants ("a", not "an")
\s\s+ more than one space
\.[A-Z] space missing between a period and the start of a new sentence



Legal notices Home Index of contents