Is it worth validating every RFC 5322 edge case?

No. Edge cases such as quoted strings, parenthesised comments, and folded whitespace are theoretically valid but occur with infinitesimal frequency in the wild. In a sample of 5.28 million addresses, exactly one contained Unicode characters. Scoping the regex to real-world inputs is the simplest thing that could possibly work.

Email Validation Regex That Works in the Real World

Name: Character distribution of real-world email mailbox names
Creator: Adam Z. Wasserman

Live tester

Type or paste an address. The matched substring is split into its capture groups, and the verdict updates as you type. Flip the speed: Realistic runs the practical pattern that covers 99.999% of real addresses; Ludicrous runs the full RFC 5321 compliance attempt, IPv4 and IPv6 address literals included. This switch drives the tester and the test suite below.

speed

drives every validator on the page

>

Show the active pattern

The email regex in every language

The realistic pattern, ready to paste, in the language you are working in. Same rule everywhere: ASCII alphanumeric, single dots, and dashes in the mailbox, then a DNS domain. Two languages need real care, and the snippets below get it right.

JavaScript / TypeScript

const EMAIL = /^[A-Za-z0-9](?:[A-Za-z0-9-]|\.(?!\.)){0,62}[A-Za-z0-9]@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,63}$/;

EMAIL.test(value);   // true / false

Python (use fullmatch, not match)

import re

EMAIL = re.compile(r"[A-Za-z0-9](?:[A-Za-z0-9-]|\.(?!\.)){0,62}[A-Za-z0-9]@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,63}")

bool(EMAIL.fullmatch(value))   # fullmatch anchors both ends

Java (note the doubled backslashes)

import java.util.regex.Pattern;

static final Pattern EMAIL = Pattern.compile(
  "^[A-Za-z0-9](?:[A-Za-z0-9-]|\\.(?!\\.)){0,62}[A-Za-z0-9]@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\\.)+[A-Za-z]{2,63}$");

EMAIL.matcher(value).matches();

Go (RE2 has no lookahead, so this is a structural variant)

// Go's regexp (RE2) does not support (?!...). Forbid double dots by
// structure instead: alphanumeric runs separated by single . or -.
var email = regexp.MustCompile(`^[A-Za-z0-9]+(?:[.-][A-Za-z0-9]+)*@(?:[A-Za-z0-9]+(?:-[A-Za-z0-9]+)*\.)+[A-Za-z]{2,63}$`)

email.MatchString(value)

PHP (PCRE, lookahead supported)

$re = '/^[A-Za-z0-9](?:[A-Za-z0-9-]|\.(?!\.)){0,62}[A-Za-z0-9]@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,63}$/';

(bool) preg_match($re, $value);

C# / .NET (verbatim string)

using System.Text.RegularExpressions;

Regex.IsMatch(value,
  @"^[A-Za-z0-9](?:[A-Za-z0-9-]|\.(?!\.)){0,62}[A-Za-z0-9]@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,63}$");

Ruby (use \A and \z, never ^ and $)

# In Ruby ^ and $ match line boundaries, so a newline can sneak an
# injection past them. Anchor with \A and \z instead.
EMAIL = /\A[A-Za-z0-9](?:[A-Za-z0-9-]|\.(?!\.)){0,62}[A-Za-z0-9]@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?\.)+[A-Za-z]{2,63}\z/

!!(value =~ EMAIL)

Whichever language you use, the regex proves shape, not existence. Pair it with a verification email before you trust an address.

RFC 5321 supersedes RFC 5322

The famous argument that email validation by regex is futile rests on RFC 5322, which treats an email address as a generic message header. Under that reading, parenthesised comments and folded whitespace are legal even inside a domain name, and the canonical "futility" test suite marks addresses with comments and Unicode as valid.

This conflates two standards. Because SMTP is universal for email transmission, no examination of address formatting is complete without RFC 5321, and RFC 5321 is explicit: the domain portion of an address is restricted, for SMTP purposes, to a sequence of letters, digits, and hyphens drawn from the ASCII character set. RFC 5321 has the final word, and it makes the problem far simpler, especially for bounding string length.

What each standard governs
Aspect	RFC 5322 (message header)	RFC 5321 (SMTP address)
Domain characters	Comments, whitespace, broad character set	ASCII letters, digits, hyphens only
Comments in parentheses	Legal, even in the domain	Not part of the deliverable address
Folded whitespace	Legal (multi-line headers)	Not applicable to delivery
Practical relevance	Theoretical edge cases	What actually gets delivered

The data: email addresses in the wild

The right measure for the practical programmer is not the standard, it is what real mailbox providers actually permit. Here is what the data says.

82% → 97% → ~100%

According to Adam Z. Wasserman's analysis of 5,280,739 real email addresses, 82% contain only ASCII alphanumeric characters, 97% contain only ASCII alphanumeric characters plus dots, and approximately 100% contain only ASCII alphanumeric characters, dots, and dashes. In other words, a further 15% add only dots, and a final 3% add only dashes.

Sample drawn from 115 million accounts. 99% confidence level, 0.055% margin of error for the general population of internet email. Source: primary analysis by Adam Z. Wasserman.

Consumer email: character distribution

ASCII alphanumeric only 82%

+ dots 97%

+ dashes ~100%

Rare features in the same sample: underscores 0.00072% (38 addresses), plus signs 0.00051% (27 addresses), Unicode 0.00002% (one single address). Assuming ASCII alphanumeric, dots, and dashes gives better than five-nines accuracy for consumer email.

Business email: the top ten hosts

Of 6,771,269 companies using 91 hosting solutions, the Pareto distribution holds: 95.19% of mailboxes sit with just ten providers. The largest three all permit only ASCII letters, numbers, and dots when creating a mailbox.

Gmail for Business 34.35%

Microsoft Exchange Online 33.60%

GoDaddy (Microsoft 365) 14.71%

7 further providers 12.53%

Business-host data from Datanyze.

When do you actually need this?

Pick what you are building. The honest answer is often "do not lean on the regex alone."

Log-redaction playground

The real use case for email regex is mining and anonymising large volumes of unstructured text: referrer logs, exports, dumps. Paste some text below and redact every address in one pass. This is the kind of job an in-house team can declare impossible and a single compiled regex can finish on a laptop in minutes.

Input text

Redacted output

The browser version above is a JavaScript adaptation. It also handles the URL-encoded %40 form of @ while skipping image filenames such as imagefile@2x.png, the same collision the original log-mining job had to dodge.

The annotated cookbook

An email address is a mailbox, an @ delimiter, and a domain. Below are composable chunks for each part. Toggle the ones you want and watch the assembled pattern build up. Each chunk carries the annotation explaining what rule it enforces.

Assembled pattern

Balancing parentheses: the open challenge

The cookbook leaves one rule unsolved: parenthesised comments are legal only when the parentheses are balanced, and so are quoted strings. Can standard regex enforce that? The honest answer is layered, because it depends entirely on what you count as "standard regex."

Pure regular expressions provably cannot

Balanced parentheses to arbitrary depth is the Dyck language, which is context-free but not regular. The pumping lemma proves no finite automaton can do it: a true regular expression has no memory to count how deep it has gone. If "standard" means formally regular, this is a hard no, and it is a theorem rather than a failure of cleverness.

But real engines are not regular, and three of them can

1. Recursion (PCRE, Perl, Ruby, Python regex module)

(?<paren>\((?:[^()]|(?&paren))*\))

(?&paren) recurses into the named group, so it matches nesting to any depth. Python's standard-library re does not support this; the third-party regex module does.

2. .NET balancing groups (matches and validates in one pattern)

^(?:[^()]|\((?<d>)|\)(?<-d>))*(?(d)(?!))$

(?<d>) pushes on an open paren, (?<-d>) pops on a close (and fails on an unmatched closer), and (?(d)(?!)) fails at the end if anything is still open.

3. Bounded depth, works in any engine including JavaScript (depth ≤ 2)

\((?:[^()]|\((?:[^()])*\))*\)

Each extra level of nesting is one more nested copy. The result is genuinely regular because the depth is finite. JavaScript has no recursion and no balancing groups, so this unrolling is the only regex option in the browser.

The pragmatic answer: count, do not match

Balance-checking is the textbook case where regex is the wrong tool and a one-pass counter is trivially correct, linear time, and engine-independent. And since the data showed parenthesised comments are statistically near zero in the wild, this check rarely earns its keep at all.

The simplest thing that could possibly work

function parensBalanced(s) {
  var depth = 0;
  for (var i = 0; i < s.length; i++) {
    var c = s[i];
    if (c === "\\") { i++; continue; }   // skip escaped char
    else if (c === "(") depth++;
    else if (c === ")" && --depth < 0) return false;
  }
  return depth === 0;
}

Try the counter (parentheses coloured by nesting depth)

>

Run the test suite

These cases include the contentious ones from the original "futility" suite plus a few of our own. The runner applies both live patterns to every address, and the column matching the current speed is highlighted, so you can see exactly where realistic and ludicrous part ways.

Address	Realistic	Ludicrous	Note

Ludicrous speed

The realistic regex stops at DNS domains because that is what 99.999% of real addresses use. But RFC 5321 also permits address literals: a bracketed IPv4 address, or the full zoo of IPv6 forms. Going to ludicrous speed means actually matching all of them. Unlike the RFC 5322 Full Monty below, this one has no POSIX classes, so it runs natively in your browser. Flip the switch and every validator on the page upgrades to it.

The RFC 5321 compliance attempt: mailbox + @ + (DNS | IPv4 literal | IPv6 literal)

Try john@[192.168.1.1] or user@[IPv6:2001:db8::1] in the tester while on ludicrous: they pass here and fail on realistic. The cost of going ludicrous is a pattern that is dramatically larger, harder to read, and slower, in exchange for matching address forms that essentially never appear in real signups. It is the regex equivalent of strapping a bigger engine to a car you only ever drive to the shops.

Want to go even faster? RFC 5322 takes you beyond ludicrous, straight to plaid. That one is in the Full Monty below, and it cannot even run in a browser.

The Full Monty: gone to plaid

Beyond ludicrous lies RFC 5322 itself. For completeness, here is the assembled pattern with named subgroups for the mailbox, single-dot rule, folded whitespace, the @ delimiter, DNS domains, and IPv4 and IPv6 address literals. It uses POSIX character classes and a folded-whitespace construct that the JavaScript regex engine does not support, so it is shown here as copyable Python.

Python (compile with re.compile, or open in Regex101)

import re

# Mailbox + @ delimiter + domain (DNS, IPv4, IPv6 literals).
# Drop the leading ^ if you are searching inside a longer string
# rather than validating a whole string.
EMAIL = re.compile(r"""
  ^(?P<mailbox>(
      [a-zA-Z0-9+!\#$%&'*\-/=?_{}|~]
    | (?P<singleDot>(?<!\.)(?<!^)\.(?!\.))
   ){1,64})
  \s?(?P<atSign>(?<!-)(?<!\.)@(?!@))
  (?P<domain>
      (?P<dns>[[:alnum:]]([[:alnum:]\-]{0,63}\.){1,24}[[:alnum:]\-]{1,63}[[:alnum:]])
    | (?P<IPv4>\[((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\])
  )$
""", re.VERBOSE)

Open the complete pattern and test data on Regex101. The full version also enumerates every IPv6 and IPv6-mapped-IPv4 form (full, compressed, and the four edge forms each), named so each subgroup can be picked apart.

No regex engine survives the jump. (Psst: you can also just type the word anywhere.)

Frequently asked questions

Can you validate an email address with regex?

Yes, for practical purposes. RFC 5321 restricts the domain to ASCII letters, digits, and hyphens, which is straightforward to match. The mailbox has more latitude, but real-world data shows that matching ASCII alphanumeric characters, dots, and dashes covers better than 99.999% of consumer addresses. What regex cannot tell you is whether the address actually exists. For that, send a verification link.

Should I use RFC 5321 or RFC 5322 for email validation?

RFC 5321. It governs SMTP, which is how email is actually transmitted, and it restricts the domain to ASCII letters, digits, and hyphens. RFC 5322 describes generic message headers and permits comments and folded whitespace that never appear in deliverable addresses.

What email regex should I actually use?

For signup, contact forms, and log parsing, match ASCII alphanumeric characters, single dots, and dashes in the mailbox, an @, and a DNS-parsable domain. That is the practical default at the top of this page. Confirm existence with a fire-and-forget verification link rather than trusting the string alone.

Why not just match every RFC 5322 edge case?

Because those cases almost never occur. In 5.28 million addresses, exactly one used Unicode and only 27 used a plus sign. Spending your time on quoted strings and parenthesised comments is optimising for inputs that do not exist. Ask what the simplest thing that could possibly work is, then build that.

Can regex match balanced parentheses?

Not with a formally regular expression: balanced parentheses form the Dyck language, which is context-free, not regular, and the pumping lemma proves no finite automaton can count nesting depth. Real engines that go beyond regular languages can: PCRE and Perl via recursion ((?&name)), and .NET via balancing groups. JavaScript has neither, so in the browser you must either unroll to a fixed depth or, better, count with a one-pass linear scan. See balancing parentheses above.

Is a plus sign valid in an email address?

Yes. A plus sign is legal in the mailbox under RFC 5321, and Gmail and others use it for sub-addressing (you+tag@gmail.com). It is rare in real data (about 0.00051% of addresses), so the strict realistic pattern omits it; if you support plus-addressing, add + to the mailbox character class, or switch the tester to ludicrous speed to see it accepted.

Are email addresses case sensitive?

The domain is case insensitive. The mailbox is technically case sensitive per RFC 5321, but in practice nearly every provider treats it case insensitively. Validate both cases; when storing and comparing, lowercase the domain, and pragmatically the whole address.

What is the best email validation regex in JavaScript?

Use the realistic pattern and call .test(). It matches ASCII alphanumeric, single dots, and dashes, which covers better than 99.999% of real addresses. Copy it (and the Python, Java, Go, PHP, C#, and Ruby versions) from the email regex in every language above.

How do I redact emails from large log files quickly?

Compile a single regex and apply it across the file in one pass rather than iterating character by character. Try it in the redaction playground above. A compiled pattern can process hundreds of millions of lines on a laptop in minutes.

The Real-World Email Regex