Java Regex Syntax
Java regular expression syntax — characters, classes, anchors, quantifiers, and special constructs.
Java Regex Syntax
A regular expression is a small pattern language for describing text. Java implements it in the java.util.regex package, and the syntax is the Perl-style dialect shared by most modern languages — with one Java-specific wrinkle: every backslash in a pattern must be doubled in a Java string literal, because the compiler eats one before the regex engine ever sees it. This chapter is a reference to that syntax: the building blocks you combine to match, search, and validate text.
Literals and metacharacters
Most characters in a pattern match themselves: cat matches the three letters c, a, t. The power comes from metacharacters — characters with special meaning that you combine into rules. The twelve that the engine treats specially are:
. ^ $ * + ? ( ) [ ] { } | \To match one of these literally, escape it with a backslash. Remember the double-backslash rule for Java source: a regex \. is written "\\." in code.
Pattern.matches("a.c", "abc"); // true — '.' matches any char
Pattern.matches("a.c", "a.c"); // true — '.' also matches a literal dot
Pattern.matches("a\\.c", "abc"); // false — '\.' matches ONLY a literal dot
Pattern.matches("a\\.c", "a.c"); // trueCharacter classes
A character class in square brackets matches any one character from a set. Ranges use a hyphen, and a leading ^ negates the set.
"[aeiou]" // any one lowercase vowel
"[a-z]" // any one lowercase letter
"[A-Za-z0-9]" // any letter or digit
"[^0-9]" // any character that is NOT a digitJava also offers predefined classes as shorthand. These are the ones you reach for constantly:
| Shorthand | Equivalent | Matches |
|---|---|---|
. | — | Any character except a line terminator |
\d | [0-9] | A digit |
\D | [^0-9] | A non-digit |
\w | [a-zA-Z0-9_] | A word character |
\W | [^a-zA-Z0-9_] | A non-word character |
\s | [ \t\n\x0B\f\r] | A whitespace character |
\S | [^\s] | A non-whitespace character |
The uppercase form is always the negation of the lowercase form.
Quantifiers: greedy, reluctant, possessive
A quantifier says how many times the preceding element may repeat. By default quantifiers are greedy — they grab as much as possible, then back off if the rest of the pattern needs it. Add ? to make a quantifier reluctant (match as little as possible), or + to make it possessive (grab and never give back).
| Quantifier | Meaning |
|---|---|
* | Zero or more |
+ | One or more |
? | Zero or one (optional) |
{n} | Exactly n |
{n,} | At least n |
{n,m} | Between n and m |
"\\d{3}" // exactly three digits
"\\d{2,4}" // two to four digits
"a+" // one or more 'a'
"colou?r" // matches "color" and "colour"
"<.+>" // greedy: on "<a><b>" matches the whole "<a><b>"
"<.+?>" // reluctant: on "<a><b>" matches just "<a>"Anchors, boundaries, and alternation
Anchors match a position, not a character. ^ is the start of input (or line, in multiline mode), $ is the end, and \b is a word boundary — the zero-width spot between a \w and a \W. Alternation with | matches either side.
"^Hello" // "Hello" only at the start
"\\.txt$" // ".txt" only at the end
"\\bcat\\b" // "cat" as a whole word, not inside "category"
"cat|dog" // "cat" or "dog"
"^(cat|dog)$" // the whole string is exactly "cat" or "dog"Note that | has very low precedence: ^cat|dog$ means (^cat)|(dog$), not ^(cat|dog)$. Wrap alternatives in a group when you want anchors to apply to both.
Groups, backreferences, and inline flags
Parentheses create a capturing group — the engine remembers what each group matched, numbered left to right starting at 1. (?:...) is a non-capturing group when you only need to apply a quantifier. A backreference \1 matches the same text the first group captured. Inline flags like (?i) change matching behavior without a separate Pattern.compile flag.
"(\\d{4})-(\\d{2})" // group 1 = year, group 2 = month
"(?:ab)+" // repeats "ab" without capturing it
"(\\w+) \\1" // a word followed by itself ("the the")
"(?i)java" // case-insensitive: matches "Java", "JAVA"
"(?m)^line" // multiline: ^ matches at each line startA worked example: the constructs in action
This program exercises a digit class with a quantifier, anchored alternation, a backreference, greedy versus reluctant matching, the \w+ shorthand, and an inline case-insensitive flag — all against java.util.regex only.
What to take from the run:
\d{4}found both1995and2011becausefind()scans for every match in the input, while a class-plus-quantifier (\drepeated{4}times) is the canonical way to match a fixed-width field. The doubled backslash in"\\d{4}"is the Java string literal producing the single-backslash regex the engine wants.Pattern.matches("cat|dog", "dog")returnedtruebut the same pattern on"catnap"returnedfalse—matches()implicitly anchors the whole input, so even thoughcatappears incatnap, the trailingnapis left unmatched and the overall match fails.- The backreference
\1turned(\w+) \1into "a word followed by the identical word," which is why it reportedtheandis— the two stutters — and ignored every word that was not immediately repeated. Backreferences match captured text, not the pattern again. - On the same
<a><b>input, greedy<.+>swallowed the entire string while reluctant<.+?>stopped at the first>, yielding just<a>. This single contrast is the most common regex bug fix you will ever make: add?to a quantifier when it grabs too much. \w+counted 3 tokens inab, cd-ef!—ab,cd, andef— because,,-, and!are all\W(non-word) characters that break a run of word characters. The inline(?i)flag then matchedjavaagainstJAVA, showing flags can live inside the pattern itself rather than only inPattern.compile.
Practice
On the input '<a><b>', why does the regex '<.+>' match the whole string '<a><b>' while '<.+?>' matches only '<a>'?