In this chapter, we will see how the search works with greedy and lazy quantifiers.
Imagine you have a text and need to replace all the quotes "..." with guillemet marks «...». For instance, "Welcome to W3Docs", must become «Welcome to W3Docs».
Here, the first step should be locating the quoted strings and then replacing them. At first sight, the regular expression /".+"/g can seem appropriate but it is not.
Let’s see an example:
So, you can see that is doesn’t work properly, in the example above.
For the proper work, it is necessary to use the greedy search.
A regular expression engine uses the algorithm below for finding a match:
- For each position in the string, the pattern is matched at that position.
- In case there is not any match, it is necessary to go to the next position.
Let’s see how the search will work for the ".+" pattern.
- The initial pattern character is the " quote.
- The quote is found. Afterward, the engine attempts to find a match for the rest of the pattern. As in this case, the character is a dot, the string letter will be 'J'. The process is demonstrated in the picture below:
- Because of the quantifier .+, the dot repeats. The regexp engine adds one character after another to the match. It stops when the end of the string is reached:
- In this stage, the engine ends repeating .+, trying to find the next character of the pattern. It will be the " quote. But a problem has occurred: the string has added, and no more characters have remained. The regexp engine understands that has got too many .+ and starts backtracking. That is, it cuts the match for the qualifier by a single character, like in the picture below:
So, it assumes that .+ finishes one character before the string end, trying to match the rest of the pattern from that position. In case there was a quote there, the search would finish but 's' is the last character.
- Therefore, the engine reduces the number of the repetitions .+ by another character. It looks like this:
So, the '"' quote doesn’t correspond to 'k'.
- The engine continues to backtrack, reducing the count of repetition for '.' till the rest of the pattern matches.
- The full match is there.
So, it can be assumed that in the greedy mode, the quantifier replays as much as possible.
Now, let’s see what a lazy mode can do.
Now, let’s investigate it step-by-step.
- The initial step is finding the pattern start '"' in the third position:
- The second step is finding a match for the dot '.', like this:
- As we have a lazy mode for +?, the engine doesn’t attempt to match a dot once more but stops and tries to match the rest of the pattern '"':
- then the regexp engine raises the number of repetitions for the dot attempting once more:
Again there is a failure. And the number of repetitions raises again and again.
- So, it happens until the rest of the pattern is found.
- The following search starts begins from the end of the current match, yielding another result:
So, the lazy mode is only enabled for the quantifier with ?.
The rest of the quantifiers stay greedy.
Modern regexp engines are capable of optimizing internal algorithms for faster work. So, it is likely that they will work differently from the algorithm, described above. As a rule, they are used internally for optimizing things. Complex regular expressions are not easy to optimize, so for them, the search can work exactly as described.
An Alternative Approach
Regular expressions usually provide several ways to do the same thing. In the example below, quoted strings are found without using the lazy mode, like this:
The regexp "[^"]+" leads to correct results as it searches for a quote '"' followed by a single or more non-quotes [^"], and, finally, the closing quote. Once the regexp engine searches for [^"]+, it stops the repetitions when it comes across the closing quote.
Please, take into account that this logic will not replace lazy quantifiers. In another example, the lazy quantifiers fail and this option works properly. For example, if you intend to find links of the form <a href="..." class="doc"> along with any href.
Let’s consider an example of using /<a href=".*" class="doc">/g:
This example works. But let’s see what will happen if lots of links exist in the text:
So, in this case, the quantifier .* again received too many characters. The match will be as follows:
<a href="....................................." class="className"> <a href="link1" class="className">... <a href="link2" class="className">
Let’s see what will happen if we try to make the quantifier .*? lazy:
Now, two matches appeared:
<a href="....." class="className"> <a href="....." class="className"> <a href="link1" class="className">... <a href="link2" class="className">
Two modes of work exist for quantifiers: greedy and lazy. In the greedy mode, the regular expression engine attempts to repeat the quantifier as many times as possible. For example, \d+ will consume all the possible digits. When consuming more is impossible, it keeps on matching the rest of the pattern. If no match is found, it decreases the number of repetitions. The lazy mode is activated by the ? mark after the quantifier. The regular expression engine attempts to match the rest of the pattern before every repetition.