W3docs

Java Regex Syntax

Java regular expression syntax — characters, classes, anchors, quantifiers, and special constructs.

Java Regex Syntax

A regular expression is a small pattern language for describing text. Java implements it in the java.util.regex package, and the syntax is the Perl-style dialect shared by most modern languages — with one Java-specific wrinkle: every backslash in a pattern must be doubled in a Java string literal, because the compiler eats one before the regex engine ever sees it. This chapter is a reference to that syntax: the building blocks you combine to match, search, and validate text.

Literals and metacharacters

Most characters in a pattern match themselves: cat matches the three letters c, a, t. The power comes from metacharacters — characters with special meaning that you combine into rules. The twelve that the engine treats specially are:

. ^ $ * + ? ( ) [ ] { } | \

To match one of these literally, escape it with a backslash. Remember the double-backslash rule for Java source: a regex \. is written "\\." in code.

Pattern.matches("a.c", "abc");   // true  — '.' matches any char
Pattern.matches("a.c", "a.c");   // true  — '.' also matches a literal dot
Pattern.matches("a\\.c", "abc"); // false — '\.' matches ONLY a literal dot
Pattern.matches("a\\.c", "a.c"); // true

Character classes

A character class in square brackets matches any one character from a set. Ranges use a hyphen, and a leading ^ negates the set.

"[aeiou]"     // any one lowercase vowel
"[a-z]"       // any one lowercase letter
"[A-Za-z0-9]" // any letter or digit
"[^0-9]"      // any character that is NOT a digit

Java also offers predefined classes as shorthand. These are the ones you reach for constantly:

ShorthandEquivalentMatches
.Any character except a line terminator
\d[0-9]A digit
\D[^0-9]A non-digit
\w[a-zA-Z0-9_]A word character
\W[^a-zA-Z0-9_]A non-word character
\s[ \t\n\x0B\f\r]A whitespace character
\S[^\s]A non-whitespace character

The uppercase form is always the negation of the lowercase form.

Quantifiers: greedy, reluctant, possessive

A quantifier says how many times the preceding element may repeat. By default quantifiers are greedy — they grab as much as possible, then back off if the rest of the pattern needs it. Add ? to make a quantifier reluctant (match as little as possible), or + to make it possessive (grab and never give back).

QuantifierMeaning
*Zero or more
+One or more
?Zero or one (optional)
{n}Exactly n
{n,}At least n
{n,m}Between n and m
"\\d{3}"     // exactly three digits
"\\d{2,4}"   // two to four digits
"a+"         // one or more 'a'
"colou?r"    // matches "color" and "colour"
"<.+>"       // greedy:    on "<a><b>" matches the whole "<a><b>"
"<.+?>"      // reluctant: on "<a><b>" matches just "<a>"

Anchors, boundaries, and alternation

Anchors match a position, not a character. ^ is the start of input (or line, in multiline mode), $ is the end, and \b is a word boundary — the zero-width spot between a \w and a \W. Alternation with | matches either side.

"^Hello"      // "Hello" only at the start
"\\.txt$"     // ".txt" only at the end
"\\bcat\\b"   // "cat" as a whole word, not inside "category"
"cat|dog"     // "cat" or "dog"
"^(cat|dog)$" // the whole string is exactly "cat" or "dog"

Note that | has very low precedence: ^cat|dog$ means (^cat)|(dog$), not ^(cat|dog)$. Wrap alternatives in a group when you want anchors to apply to both.

Groups, backreferences, and inline flags

Parentheses create a capturing group — the engine remembers what each group matched, numbered left to right starting at 1. (?:...) is a non-capturing group when you only need to apply a quantifier. A backreference \1 matches the same text the first group captured. Inline flags like (?i) change matching behavior without a separate Pattern.compile flag.

"(\\d{4})-(\\d{2})"   // group 1 = year, group 2 = month
"(?:ab)+"             // repeats "ab" without capturing it
"(\\w+) \\1"          // a word followed by itself ("the the")
"(?i)java"            // case-insensitive: matches "Java", "JAVA"
"(?m)^line"           // multiline: ^ matches at each line start

A worked example: the constructs in action

This program exercises a digit class with a quantifier, anchored alternation, a backreference, greedy versus reluctant matching, the \w+ shorthand, and an inline case-insensitive flag — all against java.util.regex only.

java— editable, runs on the server

What to take from the run:

  • \d{4} found both 1995 and 2011 because find() scans for every match in the input, while a class-plus-quantifier (\d repeated {4} times) is the canonical way to match a fixed-width field. The doubled backslash in "\\d{4}" is the Java string literal producing the single-backslash regex the engine wants.
  • Pattern.matches("cat|dog", "dog") returned true but the same pattern on "catnap" returned falsematches() implicitly anchors the whole input, so even though cat appears in catnap, the trailing nap is left unmatched and the overall match fails.
  • The backreference \1 turned (\w+) \1 into "a word followed by the identical word," which is why it reported the and is — the two stutters — and ignored every word that was not immediately repeated. Backreferences match captured text, not the pattern again.
  • On the same <a><b> input, greedy <.+> swallowed the entire string while reluctant <.+?> stopped at the first >, yielding just <a>. This single contrast is the most common regex bug fix you will ever make: add ? to a quantifier when it grabs too much.
  • \w+ counted 3 tokens in ab, cd-ef!ab, cd, and ef — because ,, -, and ! are all \W (non-word) characters that break a run of word characters. The inline (?i) flag then matched java against JAVA, showing flags can live inside the pattern itself rather than only in Pattern.compile.

Practice

Practice

On the input '<a><b>', why does the regex '<.+>' match the whole string '<a><b>' while '<.+?>' matches only '<a>'?