In this chapter, we will show you the ways to overcome such situations.
An Example of a Regexp
For a better understanding of the issue, let’s start at examples.
Imagine having a string and intending to check whether it involves words with an optional \s? space after each of them. You can use the ^(\w+\s?)*$ regexp, specifying 0 or more words, like this:
This code works, and the result is correct. But, it takes too much time on certain things.
Not all regular expression engines can handle it.
Words and Strings
So, the reason of the hanging is that a word may be represented as one \w+ or multiple:
As the string ends with an exclamation sign !, it’s obvious that there is no match. But, at the end, the regular expression considers a \w character or a space \s. But the engine is not aware of that.
As there are too many combinations to search for, it takes too much time.
There exist two primary solutions to the problem.
The first solution is to lower the number of possible combinations. For that purpose, you can just rewrite the regexp, using its equivalent, like in the example below:
As you can see in the example above, the * goes after \w+\s instead of \w+\s?.
Representing one word of the string with multiple successive \w+ becomes impossible.
The (\w+\s?)* pattern can match the word string as two \w+, like this:
In this case, there may be \w+\s or \w+\s\w+\s, and not \w+\w+.
The second solution is forbidding to backtrack for the quantifier.
The engine of the regexp tries different combinations that can be wrong for a human.
For instance, it’s obvious that in the (\d+)*$ regexp + couldn’t backtrack.
Replacing one \d+ with two separate \d+\d+ will not change anything:
To prevent the problems, modern engines use greedy quantifiers that don’t backtrack.
Lookahead is used for preventing backtracking.
Let’s interpret how it works:
- Lookahead ?= searches for the longest word \w+, beginning at the current position.
- The parentheses’ contents with ?=... is not remembered by the engine. So, wrapping into parentheses will make the engine memorize the contents.
- It is referenced in the pattern as \1 by all of us.
So, when there is a word \w+, you should match it as \1.
The reason is that a lookahead detects a word \w+ as a whole and you capture it in the pattern using \1. Essentially, a possessive plus + quantifier is implemented, capturing the whole word and not a part of it.
Let’s see the following example:
If you put a more complicated regexp to (?=(\w+))\1 rather than \w, you should prevent backtracking for + after that.
In the example below, lookahead is used for preventing backtracking:
In the example above, instead of \1 is used \2 as additional outer parentheses are there. To prevent a mess with numbers, you can name the parentheses, like here:
Here we overviewed the two main solutions to that problem: lowering the possible combinations count and preventing backtracking.
The most common and efficient way to prevent backtracking is using lookahead.