What is a regular expression?
A short reference on regular expressions — what a pattern actually describes, the building blocks (character classes, quantifiers, anchors, groups, lookarounds), the flags, how flavours differ across languages, and the catastrophic-backtracking footgun to avoid.
The three-line definition
A regular expression (regex) is a small pattern that describes a set of strings. You hand the pattern and some text to a regex engine, and the engine reports where in the text the pattern matches — and, optionally, which sub-parts it captured.
That is the whole idea. Everything below is the vocabulary you use to build the pattern, plus the two things that bite people: flavour differences between languages, and one performance footgun.
flowchart LR
P["pattern<br/>\d+"] --> E(("regex<br/>engine"))
T["text<br/>order 42, qty 7"] --> E
E --> M["matches<br/>42 and 7"]
In code, that looks like:
const text = "order 42, qty 7"
// \d+ means "one or more digits"; the g flag finds every match
const matches = text.match(/\d+/g)
// → ["42", "7"]
▶ Open the regex tester — paste any pattern from this page and watch it match live, with a token-by-token breakdown.
The building blocks
Literals and the dot
Most characters match themselves. cat matches the three letters c, a, t in order. The dot . is the wildcard — it matches any single character except a line break.
Character classes
Square brackets match any one character from a set: [aeiou] matches a single vowel. A dash makes a range — [a-z], [0-9]. A leading caret negates the set — [^0-9] matches any character that is not a digit (character-class reference).
Three shorthands cover the common sets:
\d— any digit (same as[0-9])\w— any word character (letter, digit, or underscore)\s— any whitespace (space, tab, newline)
Their uppercase forms \D \W \S negate each.
Quantifiers
Quantifiers say how many of the preceding token to match:
*— zero or more+— one or more?— zero or one (optional){3}— exactly three;{3,}— three or more;{3,5}— between three and five
Quantifiers are greedy by default: they grab as much as they can. Add a ? to make them lazy so they stop at the first opportunity. On <b>hi</b>, the pattern <.+> greedily matches the whole string, while <.+?> lazily matches just <b>.
Anchors
Anchors match a position, not a character. ^ is the start of the string, $ is the end, and \b is a word boundary. \bcat\b matches the word cat but not the cat inside category.
Groups and capture
Parentheses group tokens and capture what they matched. (\d{4}) captures a four-digit run so you can read it back. Name a group with (?<name>...) to read it back by name instead of by number.
Alternation
A pipe | means OR. (cat|dog|fish) matches any one of the three. Wrap it in a group to bound the choice to part of the pattern.
Lookarounds
A lookahead (?=...) asserts what must follow without consuming it; a lookbehind (?<=...) asserts what must precede. \d+(?=€) matches the digits in 40€ but leaves the € out of the match. Negative forms (?!...) and (?<!...) assert the opposite.
Flags
Flags change how the whole pattern is applied:
| Flag | Effect |
|---|---|
g |
global — find all matches, not just the first |
i |
case-insensitive |
m |
multiline — ^ and $ match at line breaks |
s |
dotall — . also matches newlines |
u |
unicode mode |
y |
sticky — match only from the current position |
A worked example
(?<year>\d{4})-(?<month>\d{2}), read left to right:
(?<year>— start a named capturing group called year\d{4}— exactly four digits)— end the group-— a literal hyphen(?<month>— start a named group called month\d{2}— exactly two digits)— end the group
flowchart LR
A["named group<br/>year"] --> B["\d{4}<br/>four digits"]
B --> C["literal<br/>hyphen"]
C --> D["named group<br/>month"]
D --> E["\d{2}<br/>two digits"]
Against shipped 2026-05, refunded 2025-12 it matches twice, capturing year and month each time.
Flavours differ between languages
The core syntax above is universal, but the edges vary by engine. The most common surprises when moving a pattern between languages:
| Feature | JavaScript | PCRE / Perl | Python re |
Go RE2 |
|---|---|---|---|---|
| Lookbehind | yes (2018+) | yes | yes (fixed-width) | no |
| Named group syntax | (?<name>) |
(?<name>) |
(?P<name>) |
(?P<name>) |
| Backreferences | yes | yes | yes | no |
Atomic groups (?>...) |
no | yes | yes (3.11+) | no |
| Worst-case time | exponential | exponential | exponential | linear |
The reason there are so many flavours — and why the spelling drifts — is historical: regex started as a 1951 mathematical notation, became a Unix tool in the 1960s, was standardised by POSIX, extended by Perl, and split into the backtracking (PCRE) and linear-time (RE2) schools. The full story is in the history of the regular expression.
The footgun: catastrophic backtracking
Most regex engines (JavaScript, PCRE, Python, Java, .NET) use backtracking. On a well-formed pattern this is fine. On a pattern with nested quantifiers, a crafted input can force the engine to explore an exponential number of paths — the pattern hangs and burns CPU. The classic example is ^(a+)*b run against a long string of as with no trailing b: the engine tries every way of splitting the as before giving up (regular-expressions.info).
The reason is that a backtracking engine tries every possible way to split the input across the nested quantifiers, and the number of ways grows exponentially with the input length:
flowchart TD
S["match (a+)* against aaaa…"] --> P1["first a+ takes 4"]
S --> P2["first a+ takes 3"]
S --> P3["first a+ takes 2"]
P2 --> Q1["next a+ takes 1"]
P3 --> Q2["next a+ takes 2"]
P3 --> Q3["next a+ takes 1, repeat"]
Q3 --> R1["…and so on, exponentially"]
A linear-time engine like RE2 explores the equivalent of a single path regardless of input, which is why it cannot blow up. When this happens on a server processing untrusted input, it is a denial-of-service vulnerability — ReDoS. It caused a 30-minute global Cloudflare outage in July 2019.
Two rules of thumb:
- Never run a backtracking regex with nested quantifiers (
(a+)*,(.*)*) against untrusted input. Engines like Go's RE2 guarantee linear time and are immune. - Do not parse JSON, CSV, HTML, or any nested format with regex. Use a real parser. Regex describes regular languages; nested structures are not regular.
Try it
The companion regex tester & explainer runs patterns live in your browser: type a pattern, see every match highlighted, read the captured groups, and get a token-by-token explanation of what the pattern does. It includes short lessons for each category above and a searchable quick reference. It uses the JavaScript engine, so the flavour notes in the table apply.