Introduction to the Rex Package Summary of Rex Using Rex in Your Program Download Rex and Put It on Your Classpath Import Rex Into Your Code Rex Basics: Literal Patterns and the Match and Occurs Operators Building Rex Patterns Rex Literals Character Classes Basic Ways to Define Character Classes.Predefined Character Classes Character Class Extensions Negated Character Classes Character Class Instersection and Subtraction Repetition Operators Repetition Operator Basics Repetition Strategies The Optional Operator Concatenation Operators Convenience Concatenation Operators The Alternation Operator Lookahead and Lookback Assertions Negative Lookahead/Lookback Assertions Restrictions on Lookahead/Lookback Assertions Back-References Flags Processing Text with Rex Patterns Extracting Sections of Input Using Named Patterns Avoiding Name Clashes with Hierarchical Names Processing All Matches in an Input String Simple String Replacement The Tokenizer Class

Introduction to the Rex Package

Regular expressions are incredibly powerful tools when dealing with any sort of textual information. Unfortunately, they are also somewhat arcane, and difficult to debug. Rex (the full package name is 'com.digitaldoodles.rex') intends to change that.

Rex has several goals:

Allow regular expressions to be built and tested in small units, and then easily assembled into larger units. This is simply good software engineering practice, but is surprisingly difficult to achieve with standard regex packages.
Instead of requiring users to learn an arcane regular expression language that often resembles line noise, allow them to create regular expressions using standard methods, operators, and other tools of Scala, and do so in such a manner that the resultant regex is much easier to read than one written directly in a "regex language". This also means that standard IDEs can provide a great deal of contextual help.
Allow groups to be named when they are defined in a regex, and allow matches to those groups to be accessed by their names. This is different from Scala, where the onus is on the programmer to ensure that (if provided) a list of group names correctly corresponds to the parentheses in a regular expression.
Provide a more flexible mechanism for iterating through the results of a regex search. In particular, when a regular expression is used to iterate over a target string, it should be possible to obtain not just the parts of the string that matched, but also the parts that didn't, as this is often useful.
Provide useful predefined regular expressions that users can easily incorporate into their own code.

Before becoming more formal, let me give you an example of how rex makes it easy to build complex regular expressions from simpler ones. Here's how to build a regular expression that recognizes complex numbers:

// *>1 means 1 or more, greedily (as many times as possible.)
val posInt = CharRange('0','9')*>1
// Lit means literal. The "-" is automatically converted to a literal.
val sign = Lit("+")|"-"
// +~ represents concatenation, ? means an element is optional.
val floatPat = sign.? +~ posInt +~ (Lit(".") +~ posInt).?
// "name" creates named groups for extracting information from matches.
val complex = floatPat.name("re") +~ sign.name("op") +~ floatPat.name("im") +~ "i"

When the complex pattern is used to find a complex number, the real and imaginary parts, and operator, can be pulled out of the match result by name: 're', 'im', and 'op'.

Notice how regular expressions in rex can be easily composed, without the necessity of worrying too much about parentheses or precedence. This is in sharp contrast to trying to build textual regular expressions manually, from smaller textual regular expressions.

Summary of Rex

If you are already familiar with regexes, the following summaries may let you start using Rex more quickly. They are are also a convenient reference once you have learned rex.

Rex character class constructors and methods, where C, C1, C2,... are existing Rex character classes.

CharSet(string) Will match a single character if that character is in string.

CharRange(char1, char) Will match a single character if that character is in the range char1-char2 inclusive.

!C Matches a single character if and only if C would not match that character.

C.charRange(char1, char).charSet(string) Matches the set of characters matched by C and the characters in the range char1-char2 and the characters in string. charSet and charRange may be invoked multiple times.

Note: If C is a negated character class, then method invocations of this nature will result in a pattern that fails to match the characters in char1-char2 or in string.

C.Digit.Punctuation... Matches the set of characters matched by C and any digit and any punctuation character; a number of methods are defined that define different character classes.

Note: If C is a negated character class, then method invocations of this nature will result in a pattern that fails to match the characters in char1-char2 or in string.

C1 /\ C2 Matches a single character if and only both C1 and C2 would match that character.

C1 - C2 Matches a single character if it is in C1 and not in C2.

Rex general pattern constructors and methods, where A, B,... are existing Rex patterns.

A +~ B Succeeds if it can first match A, then match B starting from where the match to A ended.

A | B Succeeds if it can match either A or B, starting the current match position.

A *> m Succeeds if it can match at least m instances of A, repeating as many times as possible while still having the match succeed.

A *> (m, n) Succeeds if it can match at least m and no more than n instances of A, repeating as many times as possible in this range while still having the match succeed.

A *< n Succeeds if it can match n or more instances of A, repeating as few times as possible while still having the match succeed.

A *< (m, n) Succeeds if it can match at least m and no more than n instances of A, repeating as few times as possible in this range while still having the match succeed.

A *! m Succeeds if it can match at least m instances of A, repeating as many times as possible regardless of the effect this has on the match.

A *! (m, n) Succeeds if it can match at least m and no more than n instances of A, repeating as many times as possible in this range regardless of the effect this has on the match.

A.>> Succeeds if A matches immediately after the current match position. The match position is not modified.

A.!>> Succeeds if A fails to match immediately after the current match position. The match position is not modified.

A.<< Succeeds if A matches immediately before the current match position. The match position is not modified.

A.!<< Succeeds if A fails to match immediately before the current match position. The match position is not modified.

Rex string testing operators, where M is a Rex pattern.

M ~~= string Returns true if M exactly matches string.

M !~~= string Returns true if M cannot exactly match string.

M ~= string Returns true if M matches some part of string.

M !~= string Returns true if M cannot match any part of string.

Rex string processing methods, where M is a Rex pattern.

M.findFirstIn(string) Find and return the first matching substring of string as an Option[MatchResult].

M.findAllIn(string) Iterate over both matching and non-matching portions of string. This is the most general-purpose method of doing text manipulation in Rex.

M.replaceAllIn(input, replacement) Convenience method for performing simple replacement with a constant string.

Using Rex in Your Program

Download Rex and Put It on Your Classpath

Get Rex from https://github.com/KenMcDonald/rex and put it on your project's classpath. As of Rex 0.7 initial release, there is no prepackaged Rex artifact; this may change in the future.

Import Rex Into Your Code

You need at least one, and generally three, import statements to use Rex in your own code:

// This imports the basic constructors and predefined patterns.
import com.digitaldoodles.rex._
// This imports a single (as of Rex 0,7) implicit conversion that allows strings to be used
// as literals in Rex expressions.
import com.digitaldoodles.rex.Implicits._
// This imports objects that contain further predefined patterns; see the API documentation for details.
import com.digitaldoodles.rex.patterns._

The second import, allowing strings to be converted to Rex literals, is a definite convenience, and in the code examples in this document, we will normally assume the import of this implicit conversion.

Rex Basics: Literal Patterns and the Match and Occurs Operators

The simplest type of pattern in Rex is a literal pattern, which is formed with the Lit constructor function; for example, Lit("bc") defines a literal pattern that matches the string "bc".

In addition, there are two different basic operators for matching a pattern against an input string. The ~~= operator is called the match operator, and returns true if and only if the pattern exactly matches the entire input string. The ~= operator is called the occurs operator, and returns true if and only if the pattern exactly matches some part of the input string.

For example, here are three matches that are all true; the part of the input string matched by the patterns is underlined.

val bc = Lit("bc")
bc ~~= "bc"
bc ~= "bc"
bc ~= "abcd"

The statement Lit("bc") !~~= "abcd" is also true; !~~= means "not ~~=", which is to say, "does not match the input string exactly". There is also !~= meaning, "the pattern does not occur in the input string."

There are much more sophisticated ways of matching against input strings, but the match and occurs operators, and their negations, give us convenient ways of showing how Rex patterns match against input strings.

Building Rex Patterns

Now that you have an idea of what Rex can do and the basic terminology and operations, it's time to go into Rex in more detail. This section discusses in detail how to build Rex patterns from the ground up, how to access predefined Rex patterns, and how to combine matchers to produce a new matcher. Note that we use the terms 'pattern' and 'matcher' interchangeably.

Rex Literals

You've already seen Rex literal matchers: Lit("123") produces a literal pattern that matches the string "123". However, there are a few more points to be made, especially if you have experience with a "standard" regular expression engine.

Characters that have special meanings in normal regex engines do not need to be escaped. In most regular expression packages, including the Java regex engine that Rex uses, there are a large number of characters (including +, *, (, ), and others) that need to be escaped (usually with a preceding backslash) to be used in a literal pattern. Rex does that for you automatically.
Conversely, characters that have special meanings in standard regexes do not have those meanings in Rex. For example, '.' in most regex engines will match (almost) any single character, but in Rex, Lit(".") simply matches a period.
You still need to be aware of string escaping. Lit("\n") will not match a backslash followed by an n, it will match a newline because the character sequence "\n" has a special meaning. To match a backslash followed by an n, you can either do Lit("\\n") or Lit("""\n""").

Character Classes

A character class is a pattern that matches a single character so long as it is in the set of characters recognized by the character class. Rex provides two basic ways to define character classes, provides a number of predefined character classes, and gives you a number of way to modify or combine existing character classes.

Basic Ways to Define Character Classes.

There are two ways to define new character classes:

The CharSet(string) construct defines a character class that contains the characters given in the string argument.
Note: A CharSet cannot take an empty string as an argument. If you pass in an empty string, you will get a runtime error.
The CharRange(char1, char2) construct defines a character class that contains all characters between and inclusive of two character arguments the give the endpoints of a range of characters. Note that the arguments are of type Char, not of type String

Predefined Character Classes

The Chars object provides a number of predefined character classes, such as Chars.Lower (lowercase characters), Chars.Digit (0-9), Chars.Punctuation (punctuation characters), and many others.

You can browse the API docs for a full list of values available in Chars, and use content assist in most IDEs to help in completing the name for one of these values.

As well, the subpackage com.digitaldoodles.rex.patterns provides a number of predefined patterns, and some of the character classes defined in Chars are aliased there, for consistentcy within the patterns offerings.

Character Class Extensions

You can form new characters classes by extending an existing class with new characters. This is done simply by invoking appropriate methods on an existing characters class, in one of two ways:

The charSet and charRange methods are analogous to the CharSet and CharRange functions, except that instead of creating from scratch a new class that encompasses a given set or range of characters, they extend an existing class with a set or range of characters, to form a new class that will match a character if it was in the invoking character class, or in the set or range of characters defined by the methods.
It's really much easier to illustrate this by example than to explain it in prose.
// Match a lower-case vowel or any upper-case letter. CharSet("aeiou").charRange('A', 'Z') // Match digits, or mathematical symbols, or a few variable names. CharRange('0', '9').charSet("+-*/^").charSet("xyz")
For each character class defined in Chars, there is an identically named method available for Rex character class patterns that produces a new character class which will match anything the original character class would match, or anything the character class identified by the method name would match. For example:
// Matches any single digit. Chars.Digit // Matches any single digit or lowercase letter. Chars.Digit.Lowercase // Matches any single digit or lowercase letter or punctuation mark. Chars.Digit.Lowercase.Punctuation // Matches digits, lowercase letters, or the basic arithmetic signs Chars.Digit.Lowercase.charSet("+-*/")

These calls can be chained to any amount required, and of course you do not have to assemble a character class "all at one go". With a couple of special exceptions we'll discuss later, you can always extend an existing character class as shown above.

Note: "Extending" a character class is purely functional; existing character classes are never modified, but rather new ones are created.

Negated Character Classes

You can use the unary negation operator "!" to obtain a characters class that matches anything that is not matched by the original class, and doesn't match anything that is matched by the original class. For example:

// Matches anything that isn't a punctuation mark.
!Chars.Punctuation
// Match anything that isn't a letter or a period.
!Chars.Lowercase.Uppercase.charSet(".")

It's important to note that a negated character class actually has an internal flag set that says, "invert your matching behavior with respect to the characters you are defined over". This means that if you extend a negated character class, you will get a character class that doesn't match the characters you use in the extension. For example:

// Match anything except a digit.
val noNumber = !Chars.Digit
// We can't "add a digit back into the match".
noNumber.charSet("0") !~~= "0"
// If we extend with something new, that will be included in the set of things that don't match.
noNumber.Uppercase !~~= "A"

You can negate a negated character class, in which case you get back a regular, non-negated character class. In fact, for any character class C, C and !!C will perform identically.

Character Class Instersection and Subtraction

The final thing you can do with character classes is to take the instersection of two of them (/\, or to subtract one from another. These are really almost the same operation, as we'll see in a minute. They are somewhat different than other character class operations, in that their result cannot be further operated on with character class operations; it is actually an instance of the FinalCharClass Scala class. This is due to the way intersection and subtraction are implemented in the underlying Java regex engine.

Character class subtraction has probably the most obvious use case. Let's say you need to be able to match any punctuation character except a period or comma. This will do it:

val restrictedPunct: FinalCharClass = Chars.punctuation - CharSet(",.")
// Note that we cannot perform any more characters class operations on the result.
restrictedPunct.Digit // COMPILE-TIME ERROR

More formally, C1 - C2 matches a character if it is in C1 and not in C2.

On the other hand, character class intersection (C1 /\ C2) matches a character only if that character is in both C1 and C2. This is probably more useful when dealing with things like categories of Unicode characters—something which I believe can be done with Java regexes, but which have not yet been implemented in Rex.

We said before that intersection and subtraction were really almost the same operation. Here's why (where C1 and C2 are character classes):

C1 - C2 is the same as C1 /\ !C2
C1 /\ C2 is the same as C1 - !C2

Repetition Operators

Repetition Operator Basics

So far, the only things we can do are match a particular sequence of characters, or a single character that can be any of a set of characters. Not very exciting. With repetition operators, we start to encounter the real power of regular expressions and rex patterns.

We'll begin with an example of the most common regular expression repetition operator:

"a" *> 0 ~~= "aaaa"

The binary operator *> is what is of interest here. It tries to match the input string by repeating its first argument at least as many times as the number given in the second argument. The ">" sign indicates this is a greedy operator, which is something we'll get to in a bit. The "*" is used in all repetition operators; partly because it is the basic repetition operator in standard regexes, and partly because * is the multiplication sign, and repetition is simply matching multiple copies of something.

Here are some more examples of simple repetition:

"a" *> 3 !~~= "aa" // Fails because the pattern must match at least three a's.
"ab" *> 0 ~~= "ababab"
"ab" *> 0 !~~= "abababa" // Fails because the last "a" in the input cannot be matched by an "ab" in the pattern.

With repetition and character classes, we now have some real power; for example, Chars.Digit *> 1 will match any unsigned integer.

A slight variant allows the second argument of *> to be a 2-tuple of integers: "a" *> (3, 5) will match a sequence of from three to five a's.

Repetition Strategies

If you ask someone to choose a piece of pie from a plate, they might choose different pieces for different reasons. A polite person might choose the smallest piece, while a greedy (or simply hungry) person might choose the largest piece. Similarly, repetition operators may choose to match with a smaller or larger part of the input string. Below is a brief description of the different strategies, followed by examples.

Greedy: The greedy repetition operator is *>, where the > sign is supposed to indicate more. The greedy operator will match as many copies of its pattern as it can in the input string, subject to the caveat that it will not go beyond the point that would cause the match to fail. Examples are coming up shortly.
Non-Greedy: This is also sometimes called the minimal repetition strategy, and is implemented with the *< operator, where the < sign is supposed to indicate the fact that this is a minimal match. A minimal repetition will match as little of the input as possible, but will enlarge the amount it matches if necessary to make the match succeed.
Possessive: The possessive operator is *!; the "!" is used because it is used in some languages to indicate a "no-backtracking" cutpoint, which is more or less what possessive repetition is. Basically, the possessive operator is like the greedy operator in that it will match as much of the input as possible, but unlike the greedy operator, it will not relinquish any part of the input if that is necessary for the match to succeed; instead, it will let the match fail.
The primary purpose of the possessive operator is to limit backtracking in pattern which would otherwise perform unacceptably slowly. Unless you have a speed problem, you should probably avoid the possessive operator, as it can make patterns more difficult to understand and debug.

All of the three repetition operators *>, *<, and *! can take either a single integer as argument, meaning, "you must match at least this many copies of your pattern", or a 2-tuple (m, n), meaning "you must match at least m but no more than n copies of your pattern".

Here are some examples to make obvious how the different repetition operators work.

Examples of Different Types of Repetition (Greedy, Non-Greedy, and Possessive). Color of background indicates which part of the pattern matched which part of the input.

// Notice that even though it is greedy, the left side of the pattern left an "ab" at the end of the
// string for the right side of the pattern to match, so that the overall pattern would succeed.
CharSet("ab")*>1 +~ CharSet("ab")*>1 ~~= "ababab"

//The left side of the pattern is non-greedy, and would prefer to match none of the input at all; however,
//the right side of the pattern can match at most two repetitions of "ab", so the left side matches the
//first part of the input, up to the last two "ab"'s, so that the overall match can succeed.
CharSet("ab")*<0 +~ CharSet("ab")*>{1,2}  ~~= "abababab"

// The left side of the pattern is possessive, and matches as much of the pattern as it can, which
// in this case is all of it; however, this leaves nothing for the right side of the pattern to match,
// and it must match at least one "ab", so the match fails. Contrast this with the example above where the
// left side of the pattern was greedy, and the match succeeded.
CharSet("ab")*! +~ CharSet("ab")*>1 !~~= "ababab"

The Optional Operator

The "optional" operator makes a pattern optional; it is written as A.?. It is really just syntactic shorthand for A *> (0, 1), but is useful not only because it cuts down some typing in a common case, but also because it is much easier to read the meaning of a pattern when the optional operator is used.

Concatenation Operators

Concatenation operators simply assemble patterns into a sequence if each part of the sequence matches the input, one after the other. The basic concatenation operator is +~; A +~ B matches an input string if A matches from the beginning of the string to some point in the string, and B matches from where A left off to the end of the string.

Why '+~'? I originally used '&' as the concatenation operator, but its precedence led to too many parentheses. Next was '+', but since that's already used for string concatenation, it caused various problems. I finally settled on '+~' because it is of appropriate precedence and '~' is (via Perl) associated with regexes.

Using concatenation plus what we already know from above, we can finally start doing some interesting things. Here's how to match an integer with an optional sign.

val SignedInt = CharSet("+-").? +~ Chars.Digit*>1
SignedInt ~~= "-1234"

With that defined, it's quite easy to build on it to obtain a pattern that matches numbers with a decimal component.

val SignedFloat = SignedInt +~ ("." +  Chars.Digit*>1).?
SignedFloat ~~= "3.14159"
SignedFloat ~~= "-100"

This is a big advantage Rex has over standard regexes—you can build complex patterns up bit by bit, making them both more readable and more testable.

Convenience Concatenation Operators

In addition to the standard concatenation operator described above, Rex defines two "convenience" concatenation operators, which can help a bit in the common case of matching strings separated by whitespace.

A +~~ B matches an input string if A matches the start of the input, there is at least some whitespace, and then B matches the rest of the input. So:

"a" +~~ "b" ~~= "a    b"
"a" +~~ "b" !~~= "ab"

By contrast, A +~~? B matches an input string if A matches the start of the input, there might be some whitespace (but it isn't required), and then B matches the rest of the input. Thus:

"a" +~~? "b" ~~= "a    b"
"a" +~~? "b" ~~= "ab"

The Alternation Operator

A simple but extremely important operator for building patterns is the alternation (|) operator. The expression A | B | C will match an input string if A matches it, or if B matches it, or if C matches it, and the alternatives will be tried in that order.

For example, the following matches numbers or letter sequences or "line noise".

val threeWay = Chars.Digit*>1 | Chars.Alphabetic*>1 | Chars.Punctuation*>1
threeWay ~~= "12345"
threeWay ~~= "Hello"
threeWay ~~= "#&%^*^"
threeWay !~~= "Hello, Number 1"

Lookahead and Lookback Assertions

We're now getting to some less-used, but still very useful, parts of regular expression construction; lookahead and lookback assertions. They're called assertions because, unlike other pattern-matching constructs, they don't advance the match position (see belo) nor are they considered part of a match when extracting subparts of a match (which will be discussed in a later section).

Let's explore what this means in a bit more detail. At the bottom, all regular expression matching is done one character at a time; a character from a pattern is matched against a character of the input, then a character from a pattern is matched against the next character in the input, and so on. This means that, as a match is being calculated, there is a part of the input that has been matched, and following that, a part of the input that has not yet been matched. The boundary between the two of these is called the match position. The match position generally moves forward, but can move backwards; this happens often with the alternation operator, when one alternative fails and the match position is "backed up" so the next alternative can be tried. In general, a pattern starts matching at the current match position and, if successful, advances the match position so that the next pattern starts matching further along the string.

The lookahed and lookback assertions break this pattern somewhat. They do, of course, calculate matches one character at a time. They also start matching at the current match position. However, they do not change the match position. This is the reason for their names; they look, but don't actually change anything. However, if they fail, then the match they are part of also fails.

In the following, the lookahead/lookback assertions and what they match are shown with a colored border rather than a colored background, to emphasize that they are not really "part of" the match, even though they must succeed for the match to succeed.

CharRange('a', 'b').<< +~ Lit(".") +~ CharRange('a', 'b').>> !~= "x.y"
CharRange('a', 'b').<< +~ Lit(".") +~ CharRange('a', 'b').>> ~= "x.y a.b")

Negative Lookahead/Lookback Assertions

In addition to the positive assertions shown above (.>> and .<<), there are negative lookahead/lookback assertions .!>> and .!<<, which succeed if they cannot match after or before the match position. This is important because, until now, the only way we've been able to say "succeed in matching if something doesn't match" has been with negated character classes, which are only good for a single character. Such logic can't be implemented with "normal" constructs such as literals, because "normal" pattern constructs advance the match position--and where would you move the match position to for a construct that succeeded if it didn't match? However, since lookahead and lookback assertions do not change the match position, they are free to implement negative logic.

The example in the section on back-references uses this ability, albeit in a simplistic manner.

Restrictions on Lookahead/Lookback Assertions

Some regular expression engines place restrictions on what patterns can be turned into (especially) a lookback assertion. This is due to the fact that lookback assertions must, in effect, make their patterns run in reverse. My understanding is that the Java regex engine is fairly general and does not put too many restrictions on lookback assertions. However, if you run into problems that seem to be related to this, you can do a number of things:

Isolate the lookback portion of your pattern as much as possible, and run it against simplified input to see if the problem recurs.
Do a Google search for something like "Java regular expression lookback" and see what you can find.
Use the .pattern method on your pattern to get the Java regex, and post it with a question to stackoverflow.com or somewhere similar.

Back-References

The final tool for building patterns are back-references. Before getting into what back-references are and how to use them, we must first touch on a subject that will be discussed in greater detail in a later section; named groups.

A named group is simply a part of a pattern that has been assigned a name using the name method. When the pattern participates in a match, the name given to the named part of the pattern may then be used to refer to whatever in the input was matched by that part of the pattern. For example, in the pattern val complexMatcher = Number.SignedFloat.name("re") +~ ("-"|"+").name("sign") +~ Number.SignedFloat.name("im") +~ "i", the various parts of a complex number have been named so that if a complex number is found by the pattern, its components may be extracted. Exactly how to do this is discussed in a later section.

Most regular expression engines enclose groups in parentheses, and identify a particular group by counting left parentheses from the start of the regex. This is extremely error prone, and one of the primary reasons why standard regexes cannot easily be composed into larger regexes. Python is one of the few languages that offers named groups, and my experience with them was so positive that I don't even allow the creation of non-named groups in Rex. Rex has a feature for eliminating name clashes, and keeps track of the correspondence between group name and group number automatically.

To understand back-references, we'll look at a slightly more involved example, that also has the advantage of showing off a major use of lookahead assertions. In most computer language, strings are quoted with double-quotes ("), but in some they can be quoted with single quotes ('). We want to build a single pattern that will match either type of string, including handle the case of backslash-escaped quote characters within the string. The solution is below.

Note: The hinky backslashing is necessary because " and \ themselves have special meanings within strings; so, for example, we need to write '\\\"' in a string so that the pattern will see '\"'.

// A quote mark is just a " or a '; we assign it the name "quote".
val quoteMark = CharSet("\"'").name("quote")
// For the string body, we will take a backslash followed by any character; if we don't find that,
// we use a negative lookahead assertion to verify that the next character is not the same
// as the quote character, and then take that.
val stringBody = ("\\" +~ Chars.Any | SameAs("quote").!>> +~ Chars.Any) *>0
// A complete quoted string is just a quote mark, a string body, and an ending quote that is the
// same as the starting quote.
val quotedString = quoteMark +~ stringBody +~ SameAs("quote")
quotedString ~~= "\"Don't say \\\"No\\\"!\""
quotedString ~~= "'Don\\'t say No!'"
// This fails because the start quote mark is ' and the end quote mark is ".
quotedString !~~= "\"Don\\'t say No!'"

Flags

Regular expression engines typically have a number of flags that can be used to change details of how a pattern or subpattern matches. During a career that has used regexes quite a bit, I've come to the conclusion that flags are error-prone and should be avoided. The problem is that they came make the same regex behave in completely different manners, and if the regex is constructed in one place but compiled (with flags) in another area of the code, it will not be at all obvious what is causing the unexpected behavior.

I believe that Rex is sufficiently flexible and powerful that there is simply no need for most of the flags found in regex engines. The exception is with case sensitivity; it's easy to build a character class that matches both upper and lower case, but not so easy to do the same with literals; for example, if you're processing HTML, you probably want the literal "span" to match both "span" and "SPAN".

As a result Rex provides the methods ASCIICaseSensitive, ASCIICaseInsensitive, UnicodeCaseSensitive, and UnicodeCaseInsensitive. When invoked on a Rex pattern, they make that pattern sensitive or insensitive to either just the ASCII characters or to all Unicode characters.

The default is case-sensitive. You can nest or combine patterns with different case sensitivities, for example

("a"|"b".UnicodeCaseSensitive).UnicodeCaseInsensitive *>0

This matches sequences such as "aAbabAAAbbAaAab". The rule is that the behavior of a pattern is determined by the innermost flag declaration containing that pattern; so the "b" literal is case sensitive.

This also means that in the pattern A.ASCIICaseInsensitive.ASCIICaseSensitive, the pattern match of A will be case insensitive, as determined by the innermost flag.

Processing Text with Rex Patterns

The prior sections have mostly talked about how to construct Rex patterns. The only ways we've seen to use them are with the matching and occurs operators, ~=, !~=, ~~=, and !~~=. While that kind of testing can certainly be useful, it is limited. Fortunately, patterns permit much more powerful text processing; you can search for text, extract parts of that text, replace that text with different text or a modified version of the same text, and so on.

Extracting Sections of Input Using Named Patterns

A major part of this utility comes from the ability to assign names to parts of a Rex pattern, and then using that name, to extract the section of the input matched by that part of the pattern. We've seen the use of named subpatterns (or groups as they are commonly referred to) in the section on backreferences above, but now it's time to take a look at the more general used of named groups.

The code below does the following:

Creates a pattern to match complex numbers, with named groups. It uses the SignedFloat pattern that is predefined in rex.patterns to make this job easier.
Uses that pattern and the findFirstIn method to find the first complex number in an input string and match against it.
Processes the Option[MatchResult] object produced by the last step to extract the parts of the complex number that was found.

The MatchResult type is the primary type for reporting the results of a match, but notice that findFirstIn returns an Option[MatchResult]; this is necessary because there is no guarantee that findFirstIn can find a match so as to produce a MatchResult. We'll discuss findAllIn shortly, which is more general.

// A complex is a float followed by a + or - followed by a float, followed by an "i"
// The two numeric parts and the sign are named for access.
val complexMatcher = Number.SignedFloat.name("re") +~ ("-"|"+").name("sign") +~ Number.SignedFloat.name("im") +~ "i"
/** Match against a floating-point complex number and print the result. */
val found: Option[MatchResult] = complexMatcher.findFirstIn("3.2+4.5i")
val complex = found match {
	case None => None
	case Some(mr) => mr("re") + " " + mr("sign") + " " + mr("im") + "i"
}
assert(complex === "3.2 + 4.5i")

Hopefully this code is fairly easy to read and understand. It may be helpful to clarify or repeat a few points:

Objects of type MatchResult are what contain information about a match.
Sections of a pattern are named using the name(nameString) method.
To extract parts of the input corresponding to named portions of the pattern, treat the MatchResult as a Map[String, String] instance, and simply say someMatchResult(patternSectionName).
Although not shown above, use the string method to get the entire portion of the input that was matched by the pattern.

Avoiding Name Clashes with Hierarchical Names

One problem with naming subgroups in Rex patterns, is that one may find oneself with a name clash when trying to combine two patterns that both use the same name somewhere within them. Rex will throw a runtime error if you do this. Fortunately, Rex provides an easy way to fix this, via hierarchical (or dotted) names.

Let's say A is a Rex pattern containing several named sections. If we now produce a new, named version of A by saying A.name("A."), then A will receive the name "A", and all names of subpatterns of A will receive the prefix "A.". This only occurs when we pass to the name method a name ending with a ".".

To see this in action, let's continue on with the complex number example shown previously, and allow it to match two complex numbers simultaneously. Since the complex number pattern has named subpatterns, we'll need to use the hierarchical naming trick to avoid names clashes.

val doubleMatcher = complexMatcher.name("num1.") +~~ complexMatcher.name("num2.")
val doubleResult = doubleMatcher.findFirstIn("1+2i 3+4i").get
assert(doubleResult("num1.re") === "1")
assert(doubleResult("num2.im") === "4")

Processing All Matches in an Input String

Most regex packages give you some way of iterating through all sections of an input string a given pattern matches. However, I've often found this frustrating, as my experience has often been that you want information about both the matched and the non-matched portions of the input, and Rex provides this by default.

The key method for iterating through matches in a string is findAllIn, which returns an object of type Iterator[MatchResult]. All MatchResult instances have a boolean value matched, which is true if the MatchResult represents a section of input that was matched by the invoking pattern, and false if the MatchResult represesents a section of the input that could not be matched by the invoking pattern. How this works is made clear by the example below.

assert( (for(m <- Lit("a").findAllIn("aabbabb")) yield m.string).mkString("") === "aabbabb")
assert( (for(m <- Lit("a").findAllIn("aabbabb") if (m.matched)) yield m.string).mkString("") === "aaa")
assert( (for(m <- Lit("a").findAllIn("aabbabb") if (!m.matched)) yield m.string).mkString("") === "bbbb")

For any given MatchResult M, if M.matched is true then sections of input matching named subpatterns may be extracted using the M(groupName) convention, whereas if M.matched is false, the only operation you can perform to obtain the (non-)matching input is M.string.

Simple String Replacement

findAllIn allows you to perform very complex string manipulation. However, if you simply need to replace parts of the input string with a constant string, see the replaceAllIn API, as it will be significantly simpler.

The Tokenizer Class

Rex provides a Tokenizer class, so-called because it can be used to process tokens from computer code in different ways--in fact, I wrote it primarily so I could provide bolding and colorization to the Scala code in this document. However, I've certainly found the general concept useful in the past, so included it in Rex.

A Tokenizer operates simply by taking a number of Rex patterns which have associated functions with them, and iterating through in input string, applying to each subsection of the input matched by one of the provided patterns the function that is associated with that pattern. A default function is used to process sections of the input that are not matched by any of the provided patterns.

The API docs have more information. As I want to get this version of Rex out before the weekend, I'm simply going to show first the tokenizer in the test suite, and then the tokenizer I constructed to process Scala code for this document.

val t = new Tokenizer(
	(mr: MatchResult) => "?",
	Seq(
		Lit("a") -> ((mr: MatchResult) => "1"),
		Lit("b") -> ((mr: MatchResult) => "2")
	)
)
assert(t.tokenize("fabaabbc").mkString === "?121122?")

val tripleQuotedString = "\"\"\"" +~ Chars.Any*<0 +~ "\"\"\""

val htmlTag = "<" +~ Chars.Any*<0 +~ ">" val singleQuotedString = """ +~ (("\\" +~ Chars.Any) | !CharSet("\"") | htmlTag)*<0 +~ """ val keyword = Word.Boundary +~ ("if"|"then"|"else"|"def"|"class"|"public"|"private"|"implicit"|"lazy"| "extends"|"with"|"case"|"final"|"sealed"|"while"|"repeat"|"until"|"import"|"package"| "new"|"override"|"try"|"catch"|"finally"|"throw"|"match"|"val"|"var") +~ Word.Boundary val lineComment = "//" +~ Chars.Any*<0 +~ CharSet("\n\r") val blockComment = "/*" +~ Chars.Any*<0 +~ "*/" val character = "'" +~ (!CharSet("'") | """\'""") +~ "'"

def wrapWithClass(clss: String, content: String) = """<span class="%s">%s</span>""" format (clss, content)

val scalaHighlighter = new Tokenizer[String]( (mr: MatchResult) => mr.string, Seq( htmlTag -> (mr => mr.string), tripleQuotedString -> (mr => wrapWithClass("scalaString", mr.string)), singleQuotedString -> (mr => wrapWithClass("scalaString", mr.string)), keyword -> (mr => wrapWithClass("scalaKeyword", mr.string)), blockComment -> (mr => wrapWithClass("scalaBlockComment", mr.string)), lineComment -> (mr => wrapWithClass("scalaLineComment", mr.string)), character -> (mr => wrapWithClass("scalaCharacter", mr.string)) ) )