← Learn··Updated 23 May 2026·4 min read

What is a regular expression?

A short reference on regular expressions — what a pattern actually describes, the building blocks (character classes, quantifiers, anchors, groups, lookarounds), the flags, how flavours differ across languages, and the catastrophic-backtracking footgun to avoid.

Programming
#regex
#regular-expression
#pattern-matching

The three-line definition

A regular expression (regex) is a small pattern that describes a set of strings. You hand the pattern and some text to a regex engine, and the engine reports where in the text the pattern matches — and, optionally, which sub-parts it captured.

That is the whole idea. Everything below is the vocabulary you use to build the pattern, plus the two things that bite people: flavour differences between languages, and one performance footgun.

flowchart LR
    P["pattern<br/>\d+"] --> E(("regex<br/>engine"))
    T["text<br/>order 42, qty 7"] --> E
    E --> M["matches<br/>42 and 7"]

In code, that looks like:

const text = "order 42, qty 7"

// \d+ means "one or more digits"; the g flag finds every match
const matches = text.match(/\d+/g)
// → ["42", "7"]
Open the regex tester — paste any pattern from this page and watch it match live, with a token-by-token breakdown.

The building blocks

Literals and the dot

Most characters match themselves. cat matches the three letters c, a, t in order. The dot . is the wildcard — it matches any single character except a line break.

Character classes

Square brackets match any one character from a set: [aeiou] matches a single vowel. A dash makes a range — [a-z], [0-9]. A leading caret negates the set — [^0-9] matches any character that is not a digit (character-class reference).

Three shorthands cover the common sets:

  • \d — any digit (same as [0-9])
  • \w — any word character (letter, digit, or underscore)
  • \s — any whitespace (space, tab, newline)

Their uppercase forms \D \W \S negate each.

Quantifiers

Quantifiers say how many of the preceding token to match:

  • * — zero or more
  • + — one or more
  • ? — zero or one (optional)
  • {3} — exactly three; {3,} — three or more; {3,5} — between three and five

Quantifiers are greedy by default: they grab as much as they can. Add a ? to make them lazy so they stop at the first opportunity. On <b>hi</b>, the pattern <.+> greedily matches the whole string, while <.+?> lazily matches just <b>.

Anchors

Anchors match a position, not a character. ^ is the start of the string, $ is the end, and \b is a word boundary. \bcat\b matches the word cat but not the cat inside category.

Groups and capture

Parentheses group tokens and capture what they matched. (\d{4}) captures a four-digit run so you can read it back. Name a group with (?<name>...) to read it back by name instead of by number.

Alternation

A pipe | means OR. (cat|dog|fish) matches any one of the three. Wrap it in a group to bound the choice to part of the pattern.

Lookarounds

A lookahead (?=...) asserts what must follow without consuming it; a lookbehind (?<=...) asserts what must precede. \d+(?=€) matches the digits in 40€ but leaves the out of the match. Negative forms (?!...) and (?<!...) assert the opposite.

Flags

Flags change how the whole pattern is applied:

Flag Effect
g global — find all matches, not just the first
i case-insensitive
m multiline — ^ and $ match at line breaks
s dotall — . also matches newlines
u unicode mode
y sticky — match only from the current position

A worked example

(?<year>\d{4})-(?<month>\d{2}), read left to right:

  • (?<year> — start a named capturing group called year
  • \d{4} — exactly four digits
  • ) — end the group
  • - — a literal hyphen
  • (?<month> — start a named group called month
  • \d{2} — exactly two digits
  • ) — end the group
flowchart LR
    A["named group<br/>year"] --> B["\d{4}<br/>four digits"]
    B --> C["literal<br/>hyphen"]
    C --> D["named group<br/>month"]
    D --> E["\d{2}<br/>two digits"]

Against shipped 2026-05, refunded 2025-12 it matches twice, capturing year and month each time.

Flavours differ between languages

The core syntax above is universal, but the edges vary by engine. The most common surprises when moving a pattern between languages:

Feature JavaScript PCRE / Perl Python re Go RE2
Lookbehind yes (2018+) yes yes (fixed-width) no
Named group syntax (?<name>) (?<name>) (?P<name>) (?P<name>)
Backreferences yes yes yes no
Atomic groups (?>...) no yes yes (3.11+) no
Worst-case time exponential exponential exponential linear

The reason there are so many flavours — and why the spelling drifts — is historical: regex started as a 1951 mathematical notation, became a Unix tool in the 1960s, was standardised by POSIX, extended by Perl, and split into the backtracking (PCRE) and linear-time (RE2) schools. The full story is in the history of the regular expression.

The footgun: catastrophic backtracking

Most regex engines (JavaScript, PCRE, Python, Java, .NET) use backtracking. On a well-formed pattern this is fine. On a pattern with nested quantifiers, a crafted input can force the engine to explore an exponential number of paths — the pattern hangs and burns CPU. The classic example is ^(a+)*b run against a long string of as with no trailing b: the engine tries every way of splitting the as before giving up (regular-expressions.info).

The reason is that a backtracking engine tries every possible way to split the input across the nested quantifiers, and the number of ways grows exponentially with the input length:

flowchart TD
    S["match (a+)* against aaaa…"] --> P1["first a+ takes 4"]
    S --> P2["first a+ takes 3"]
    S --> P3["first a+ takes 2"]
    P2 --> Q1["next a+ takes 1"]
    P3 --> Q2["next a+ takes 2"]
    P3 --> Q3["next a+ takes 1, repeat"]
    Q3 --> R1["…and so on, exponentially"]

A linear-time engine like RE2 explores the equivalent of a single path regardless of input, which is why it cannot blow up. When this happens on a server processing untrusted input, it is a denial-of-service vulnerability — ReDoS. It caused a 30-minute global Cloudflare outage in July 2019.

Two rules of thumb:

  1. Never run a backtracking regex with nested quantifiers ((a+)*, (.*)*) against untrusted input. Engines like Go's RE2 guarantee linear time and are immune.
  2. Do not parse JSON, CSV, HTML, or any nested format with regex. Use a real parser. Regex describes regular languages; nested structures are not regular.

Try it

The companion regex tester & explainer runs patterns live in your browser: type a pattern, see every match highlighted, read the captured groups, and get a token-by-token explanation of what the pattern does. It includes short lessons for each category above and a searchable quick reference. It uses the JavaScript engine, so the flavour notes in the table apply.