Regex Basics for Real Work
Regular expressions are not useful because they look clever. They are useful because they help you find, extract, validate, or clean text patterns quickly. The trick is to keep the pattern as small and explicit as possible for the job you actually need to do.
Three Core Building Blocks
| Concept | Example | What it does |
|---|---|---|
| Character class | [a-z] |
Matches one lowercase letter |
| Quantifier | \d{4} |
Matches exactly four digits |
| Anchor | ^error |
Matches "error" only at the start of the line |
Character Classes in Detail
Character classes define which characters are allowed at a given position. Regex engines provide predefined shorthand classes for the most common sets, and you can build custom ranges for anything else.
Predefined Character Classes
| Shorthand | Equivalent | Matches |
|---|---|---|
\d |
[0-9] |
Any digit |
\D |
[^0-9] |
Any non-digit |
\w |
[a-zA-Z0-9_] |
Word character (letter, digit, underscore) |
\W |
[^a-zA-Z0-9_] |
Non-word character |
\s |
[ \t\n\r\f\v] |
Any whitespace character |
\S |
[^ \t\n\r\f\v] |
Any non-whitespace character |
. |
Almost everything | Any character except newline (unless dotall mode is on) |
Custom Ranges
Square brackets let you define exactly which characters to allow. A caret inside the brackets negates the set.
[aeiou]— matches any lowercase vowel[A-Fa-f0-9]— matches a hexadecimal digit[^0-9]— matches anything that is not a digit[a-zA-Z]— matches any ASCII letter regardless of case
Quantifier Cheat Sheet
Quantifiers control how many times the preceding element must appear. By default they are greedy, meaning they match as much text as possible. Append ? to make any quantifier lazy (match as little as possible).
| Quantifier | Meaning | Example | Matches |
|---|---|---|---|
* |
Zero or more | go*d |
"gd", "god", "good", "goood" |
+ |
One or more | go+d |
"god", "good", "goood" (not "gd") |
? |
Zero or one | colou?r |
"color", "colour" |
{n} |
Exactly n | \d{4} |
Exactly four digits like "2024" |
{n,} |
n or more | \w{3,} |
Words with three or more characters |
{n,m} |
Between n and m | [a-z]{2,5} |
Two to five lowercase letters |
Grouping and Capturing
Parentheses serve two purposes: they group elements for quantifiers, and they capture the matched text so you can reference it later (in replacements or in code).
Capturing Groups
// Pattern: (\d{4})-(\d{2})-(\d{2})
// Input: 2024-03-15
// Group 1: 2024
// Group 2: 03
// Group 3: 15
Each pair of parentheses creates a numbered group. Group 0 is always the entire match. In a replacement string, you reference groups with $1, $2, etc. (or \1, \2 depending on the engine).
Non-Capturing Groups
When you need grouping for alternation or quantifiers but do not need the captured value, use (?:...). This avoids polluting your group numbering and is slightly faster.
// Capturing: (https?|ftp)://
// Non-capturing: (?:https?|ftp)://
Lookahead and Lookbehind
Lookarounds assert that something exists before or after the current position without including it in the match. They are zero-width: they check a condition but consume no characters.
| Type | Syntax | Meaning |
|---|---|---|
| Positive lookahead | X(?=Y) |
Match X only if followed by Y |
| Negative lookahead | X(?!Y) |
Match X only if NOT followed by Y |
| Positive lookbehind | (?<=Y)X |
Match X only if preceded by Y |
| Negative lookbehind | (?<!Y)X |
Match X only if NOT preceded by Y |
A practical example: \d+(?= USD) matches a number only when it appears before " USD", so in "Price: 150 USD" it matches "150" but not in "150 items".
Useful Beginner Patterns
- Email-like match:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} - ISO date:
\d{4}-\d{2}-\d{2} - Simple HTML tag match:
<[^>]+> - US phone number:
\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} - IPv4 address:
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
How to Practice Safely
Test the pattern against a tiny sample first, then expand to more realistic input. Use the Regex Tester Online to see what matches and how the pattern behaves before you run it on full logs, exports, or copied content.
Practical Exercises
Try these challenges in the Regex Tester. Solutions are hidden below each one.
Challenge 1: Extract Prices
Given the text "Items cost $12.99, $3.50, and $149.00", write a pattern that captures all dollar amounts including the decimal portion.
Solution
\$\d+\.\d{2}
The escaped \$ matches a literal dollar sign. \d+ matches one or more digits before the decimal, and \.\d{2} matches the period plus exactly two decimal digits.
Challenge 2: Validate a Username
Match usernames that are 3 to 16 characters long and contain only letters, digits, and underscores. The first character must be a letter.
Solution
^[a-zA-Z]\w{2,15}$
The anchor ^ ensures we start at the beginning. [a-zA-Z] forces a letter first. \w{2,15} allows 2 to 15 more word characters, giving a total length of 3 to 16. The $ anchor prevents trailing characters.
Challenge 3: Find Lines Without a Keyword
Match entire lines that do NOT contain the word "error" (case-insensitive).
Solution
^(?!.*error).*$
The negative lookahead (?!.*error) at the start of the line asserts that "error" does not appear anywhere on the line. If the assertion passes, .*$ matches the full line. Use the i flag for case insensitivity.
Challenge 4: Swap First and Last Name
Given "Doe, Jane", rearrange to "Jane Doe" using a replacement pattern.
Solution
// Pattern: (\w+),\s*(\w+)
// Replacement: $2 $1
Group 1 captures the last name, group 2 captures the first name. The replacement reverses the order and drops the comma.
Common Beginner Mistakes
- Using
.*when a tighter class would be safer - Forgetting anchors and matching more than expected
- Escaping too little or too much
- Testing only on perfect input instead of real-world messy text
- Using greedy quantifiers when lazy ones would prevent over-matching
- Forgetting that
.does not match newlines by default
Frequently Asked Questions
What is the difference between greedy and lazy matching?
A greedy quantifier like .* consumes as much text as possible while still allowing the overall pattern to match. A lazy quantifier like .*? consumes as little as possible. For example, given <b>one</b> and <b>two</b>, the greedy pattern <b>.*</b> matches everything from the first <b> to the last </b>, while the lazy <b>.*?</b> matches each tag pair individually.
When should I use non-capturing groups?
Use (?:...) when you need to group for alternation or repetition but do not need the matched text for back-references or replacements. This keeps your group numbering clean and avoids a minor performance cost in engines that store captured content.
Are lookaheads supported in all regex engines?
Positive and negative lookaheads are widely supported in JavaScript, Python, Java, .NET, and most modern engines. Lookbehinds have broader support now but some engines (older JavaScript versions before ES2018) do not support them. Always test in your target environment.
How do I match a literal special character like a dot or bracket?
Precede the character with a backslash: \. matches a literal period, \[ matches a literal bracket. Inside a character class, most special characters lose their meaning, so [.] also matches a literal dot.
Can I use regex to parse full HTML or XML documents?
No. Regex cannot handle nested, recursive structures reliably. For quick extraction of a single tag or attribute in known-clean markup, a simple pattern works. For anything more complex, use a proper parser. The HTML Cleaner handles stripping and sanitization without regex.
Next Step
Once you understand the basics, move to the regex debugging guide for greedy matches, multiline input, escaping problems, and performance issues. For a broader theory overview, see the regular expressions explained guide.
Related Tools
- Regex Tester Online for live matching and plain-language explanations
- Remove Line Breaks to clean copied text before testing patterns
- Remove Duplicate Lines for post-match cleanup workflows
- HTML Cleaner when regex is overkill for tag stripping
Related Guides
- Regex Debugging — fixing greedy matches, escaping, and performance
- Regular Expressions Explained — deeper theory and advanced features
- Text Cleaning — the broader workflow where regex plays a key role
- Data Cleaning Best Practices — applying regex in data pipelines