What is a regular expression?

The three-line definition

A regular expression (regex) is a small pattern that describes a set of strings. You hand the pattern and some text to a regex engine, and the engine reports where in the text the pattern matches — and, optionally, which sub-parts it captured.

That is the whole idea. Everything below is the vocabulary you use to build the pattern, plus the two things that bite people: flavour differences between languages, and one performance footgun.

flowchart LR
    P["pattern<br/>\d+"] --> E(("regex<br/>engine"))
    T["text<br/>order 42, qty 7"] --> E
    E --> M["matches<br/>42 and 7"]

In code, that looks like:

const text = "order 42, qty 7"

// \d+ means "one or more digits"; the g flag finds every match
const matches = text.match(/\d+/g)
// → ["42", "7"]

▶ Open the regex tester — paste any pattern from this page and watch it match live, with a token-by-token breakdown.

The building blocks

Literals and the dot

Most characters match themselves. cat matches the three letters c, a, t in order. The dot . is the wildcard — it matches any single character except a line break.

Character classes

Square brackets match any one character from a set: [aeiou] matches a single vowel. A dash makes a range — [a-z], [0-9]. A leading caret negates the set — [^0-9] matches any character that is not a digit (character-class reference).

Three shorthands cover the common sets:

\d — any digit (same as [0-9])
\w — any word character (letter, digit, or underscore)
\s — any whitespace (space, tab, newline)

Their uppercase forms \D \W \S negate each.

Quantifiers

Quantifiers say how many of the preceding token to match:

* — zero or more
+ — one or more
? — zero or one (optional)
{3} — exactly three; {3,} — three or more; {3,5} — between three and five

Quantifiers are greedy by default: they grab as much as they can. Add a ? to make them lazy so they stop at the first opportunity. On <b>hi</b>, the pattern <.+> greedily matches the whole string, while <.+?> lazily matches just <b>.

Anchors

Anchors match a position, not a character. ^ is the start of the string, $ is the end, and \b is a word boundary. \bcat\b matches the word cat but not the cat inside category.

Groups and capture

Parentheses group tokens and capture what they matched. (\d{4}) captures a four-digit run so you can read it back. Name a group with (?<name>...) to read it back by name instead of by number.

Alternation

A pipe | means OR. (cat|dog|fish) matches any one of the three. Wrap it in a group to bound the choice to part of the pattern.

Lookarounds

A lookahead (?=...) asserts what must follow without consuming it; a lookbehind (?<=...) asserts what must precede. \d+(?=€) matches the digits in 40€ but leaves the € out of the match. Negative forms (?!...) and (?<!...) assert the opposite.

Flags

Flags change how the whole pattern is applied:

Flag	Effect
`g`	global — find all matches, not just the first
`i`	case-insensitive
`m`	multiline — `^` and `$` match at line breaks
`s`	dotall — `.` also matches newlines
`u`	unicode mode
`y`	sticky — match only from the current position

A worked example

(?<year>\d{4})-(?<month>\d{2}), read left to right:

(?<year> — start a named capturing group called year
\d{4} — exactly four digits
) — end the group
- — a literal hyphen
(?<month> — start a named group called month
\d{2} — exactly two digits
) — end the group

flowchart LR
    A["named group<br/>year"] --> B["\d{4}<br/>four digits"]
    B --> C["literal<br/>hyphen"]
    C --> D["named group<br/>month"]
    D --> E["\d{2}<br/>two digits"]

Against shipped 2026-05, refunded 2025-12 it matches twice, capturing year and month each time.

Flavours differ between languages

The core syntax above is universal, but the edges vary by engine. The most common surprises when moving a pattern between languages:

Feature	JavaScript	PCRE / Perl	Python `re`	Go RE2
Lookbehind	yes (2018+)	yes	yes (fixed-width)	no
Named group syntax	`(?<name>)`	`(?<name>)`	`(?P<name>)`	`(?P<name>)`
Backreferences	yes	yes	yes	no
Atomic groups `(?>...)`	no	yes	yes (3.11+)	no
Worst-case time	exponential	exponential	exponential	linear

The reason there are so many flavours — and why the spelling drifts — is historical: regex started as a 1951 mathematical notation, became a Unix tool in the 1960s, was standardised by POSIX, extended by Perl, and split into the backtracking (PCRE) and linear-time (RE2) schools. The full story is in the history of the regular expression.

The footgun: catastrophic backtracking

Most regex engines (JavaScript, PCRE, Python, Java, .NET) use backtracking. On a well-formed pattern this is fine. On a pattern with nested quantifiers, a crafted input can force the engine to explore an exponential number of paths — the pattern hangs and burns CPU. The classic example is ^(a+)*b run against a long string of as with no trailing b: the engine tries every way of splitting the as before giving up (regular-expressions.info).

The reason is that a backtracking engine tries every possible way to split the input across the nested quantifiers, and the number of ways grows exponentially with the input length:

flowchart TD
    S["match (a+)* against aaaa…"] --> P1["first a+ takes 4"]
    S --> P2["first a+ takes 3"]
    S --> P3["first a+ takes 2"]
    P2 --> Q1["next a+ takes 1"]
    P3 --> Q2["next a+ takes 2"]
    P3 --> Q3["next a+ takes 1, repeat"]
    Q3 --> R1["…and so on, exponentially"]

A linear-time engine like RE2 explores the equivalent of a single path regardless of input, which is why it cannot blow up. When this happens on a server processing untrusted input, it is a denial-of-service vulnerability — ReDoS. It caused a 30-minute global Cloudflare outage in July 2019.

Two rules of thumb:

Never run a backtracking regex with nested quantifiers ((a+)*, (.*)*) against untrusted input. Engines like Go's RE2 guarantee linear time and are immune.
Do not parse JSON, CSV, HTML, or any nested format with regex. Use a real parser. Regex describes regular languages; nested structures are not regular.

Try it

The companion regex tester & explainer runs patterns live in your browser: type a pattern, see every match highlighted, read the captured groups, and get a token-by-token explanation of what the pattern does. It includes short lessons for each category above and a searchable quick reference. It uses the JavaScript engine, so the flavour notes in the table apply.