Putting several characters or character classes inside square brackets allows searching for any character among the given.
To be precise, let’s consider an example. Here, [lam] means any of the given three characters 'l', 'a', or 'm'. It is known as a “set”. You can use them with regular characters in a regexp like this:
Although multiple characters exist in the set, they match exactly a single character in the match.
So, there are no matches in the example below:
The pattern looks for W, then one of these letters [3D], and, finally, ocs.
So, here could be a match for W3ocs or WDocs.
Square brackets can also include the so-called character ranges.
For example, [a-m] is a character in range from “a” to “m”, and [0-7] is a digit from “0” to “7”.
Let’s see an example where “x” is followed by two digits or letters from A toF.
So, in the example above,[0-9A-F] includes two ranges: it looks for a character that is either a digit from “0” to “9” or a letter from “A” to “F”.
In case you want to search for lowercase letters, you can either add the a-f range or add the e flag.
Inside […], you can also use character classes.
For example, if you try to search for the character \w or a hyphen -, then the set will be [\w-]. You can also combine different classes such as [\s\d].
As \w is a shorthand for [a-zA-Z0-9_] it’s not capable of finding Cyrillic letters, Chinese hieroglyphs, and so on.
A more universal pattern can be written. It can search for wordy characters in every language. With Unicode properties, it’s quite easy:
Let’s interpret it. Like \w, it includes characters with Unicode properties, like here:
- for letters -Alphabetic (Alpha).
- for accents - Mark (M).
- for digits - Decimal_Number (Nd).
- for underscore and similar characters -Connector_Punctuation (Pc).
- for ligatures such as Arabic are used two special codes 200c and 200d - Join_Control (Join_C).
- the . + ( ) symbols don’t need escaping.
- a hyphen - is not escaped at the start or the end.
- a caret ^ is not escaped at the start.
- the closing square bracket is always escaped.
Here is how it will look like:
There is another type of ranges, besides normal ranges: the excluding ranges that look like this [^…] . They are signified by a caret character ^ at the start and correspond to any character except for the given ones.
Any character except for letters, spaces, and digits is searched for in the example below:
Escaping in […]
As a rule, when a special character needs to be found, it should be escaped like \.. If a backslash is necessary, then \\ is used.
A vast majority of special characters can be used in square brackets without escaping:
Otherwise speaking, you can use all the special characters without escaping, except when they mean something for square brackets.
A dot . inside square brackets considers merely a dot.
The [.,] pattern will search for one of the characters: a comma or a dot.
Here is an example:
In the example above, the search is for one of the following characters: *().^+.
Ranges and Flag “u”
In case there exist surrogate pairs in the set, the u flag is necessary for them to work properly.
Search for [ÇĢ] in the Ģ string will look like this:
The result is not correct as regular expressions by default don’t recognize surrogate pairs.
The engine of the regexp thinks that [ĢÇ] are four characters, not two:
- the left half of Ģ(1).
- the right half of Ģ(2).
- the left half of Ç(3).
- the right half of Ç(4).
Their codes can be seen as follows:
So, the the left half of Ç is found and shown.
Adding the flag u will make it proper:
The same thing happens while searching for a range like [Ç-Ģ].
Forgetting to add the u flag will lead to an error, like this:
So, the pattern will look properly with the u flag:
The error happens because without the u flag surrogate pairs are recognized as two characters.