Character classes are used for distinguishing characters like distinguishing between digits and letters.
Let’s start from a practical case. Imagine you have a phone number like +3(522) -865-42-76, and wish to turn it into pure numbers (35228654276). To meet that goal, it is necessary to find and remove everything that’s not a number.Character classes are there to help you with that.
So, a character class can be described as a specific notation that corresponds to any symbol from a certain set.
We will start from the “digit” class. It should be written as \d and matches any single digit.
In the example below, let’s find the first digit:
With no flag g, the regular expression searches for the first match, which is the first \d.
Adding the g flag will enable finding all the digits, like this:
So, it is a character class for digits. But there exist other character classes, too.
The most used character classes are as follows:
- \d ( comes from digit): a digit (a character from 0 to 9).
- \s ( comes from space): a space symbol. It contains \t (tabs),\n (newlines), and other characters (\v, \f,\r ).
- \w (comes from word): it is either a letter of the Latin alphabet, a digit, or an underscore (_). Non-latin letters don't belong to this class.
A regular expression can include regular symbols, as well as character classes.
Let’s see an example where CSS\d corresponds to a string CSS with a digit following it:
Multiple character classes can be used, like this:
There is an “inverse class” for every character class, denoted with the same but uppercase letter.
“Inverse” means that it corresponds to all other characters:
- \D - non-digit. It accepts any character, except \d (for instance, a letter).
- \S - non-space. Accepts any character, except \s (for instance, a letter).
- \W- non-wordly character. Accepts anything , except \w ( non-latin letter or a space).
A dot (.) is considered a special character class corresponding to “any character except a newline”.
The example will look like this:
In the example below, the dot is in the middle of a regexp:
So, the dot is considered “any character”, but not the “absense of a character”.
There should be a character for matching it, like here:
A dot doesn’t correspond to the newline character \n by default.
For example, the regexp A.B corresponds to A, and then B with any character between them, except for an \n newline, like this:
There are circumstances when one wants a dot to mean “any character”, including newline.
The flag s is used for that. In case a regexp has it, then a dot corresponds literally to any character, like this:
It is important to pay special attention to spaces. For example, the strings 1-5 and 1 - 5 are similar to each other. But, in case a regexp doesn’t take spaces into account, it might not work.
For finding the digits, separated by a hyphen, you can act like this:
Now, let’s fix it by adding spaces in the regular expression \d - \d, like here:
A space is considered a character. In importance, it is equal to any other character. You can add or remove spaces from a regexp, expecting to work the same way. That is, in a regexp all the characters matter.