Quick Answer: NFC vs NFD
Use NFC for storage, APIs, HTML, search indexing, and most production text handling. Use NFD only when you intentionally need decomposed characters for analysis, accent stripping, or compatibility with systems that store decomposed filenames. If you are unsure, choose NFC.
What Unicode Normalization Is
Unicode normalization converts equivalent strings into a consistent form so visually identical text behaves consistently in code, search, storage, and comparison. Without normalization, two strings that look the same on screen can fail equality checks because they are stored as different code-point sequences.
The most common comparison is NFC vs NFD. NFC keeps characters in composed form where possible. NFD decomposes them into base characters plus combining marks. Both are valid Unicode. They are not interchangeable unless you normalize them first.
Composed vs Decomposed: Visual Code Point Comparison
The character "e with acute accent" can be stored two ways in Unicode. Both render identically on screen, but they are different byte sequences.
| Form | Display | Code Points | Bytes (UTF-8) |
|---|---|---|---|
| NFC (composed) | é | U+00E9 (1 code point) |
C3 A9 (2 bytes) |
| NFD (decomposed) | é | U+0065 U+0301 (2 code points) |
65 CC 81 (3 bytes) |
More complex characters have larger differences. The Korean syllable "ga" in NFC is a single code point (U+AC00), while in NFD it decomposes to three code points (U+1100 U+1161). Characters with multiple combining marks, like "o with diaeresis and macron," can decompose to three or more code points.
NFC vs NFD Side by Side
| Property | NFC | NFD |
|---|---|---|
How é is stored |
Single composed code point | Base letter + combining accent |
| Best default for web apps | Yes | No |
| Best for storage and indexing | Yes | Usually no |
| Best for decomposition work | No | Yes |
| String equality without normalization | NFC != NFD even when the text looks identical | |
Comparison summary: NFC is the default form for storage, indexing, APIs, and most web application text. NFD is mainly useful when you intentionally need decomposed characters for analysis or accent stripping.
NFKC and NFKD: Compatibility Normalization
Beyond NFC and NFD, Unicode defines two compatibility forms: NFKC and NFKD. These replace compatibility characters with their canonical equivalents in addition to composing or decomposing.
| Form | Composition | Compatibility Mapping | Use Case |
|---|---|---|---|
| NFC | Composed | No | General storage, APIs, web content |
| NFD | Decomposed | No | Accent stripping, character analysis |
| NFKC | Composed | Yes | Search indexing, username comparison, identifier matching |
| NFKD | Decomposed | Yes | Full decomposition for analysis |
Compatibility normalization folds characters that are semantically equivalent but visually distinct. Examples include:
- Fullwidth digits:
1(U+FF11) becomes1(U+0031) - Ligatures:
fi(U+FB01, fi ligature) becomesfi(two characters) - Superscripts:
²(U+00B2) becomes2(U+0032) - Circled letters:
Ⓐ(U+24B6) becomesA(U+0041)
Use NFKC when you need maximum folding for search or identity comparison. Do not use it for display or storage where the visual distinction matters, because the normalization is lossy and cannot be reversed.
When to Use NFC
- Database storage: normalize user input before insert and comparison
- Search indexing: avoid missed matches on accented text
- API payloads: keep signatures, checksums, and comparisons consistent
- Slug generation: normalize before transliteration in the URL Slug Generator
- General web content: use NFC as the safe default for production text
When to Use NFD
- Accent stripping: decompose first, then remove combining marks
- Character-level analysis: inspect combining marks directly
- Special compatibility cases: work with systems that already store decomposed text
Normalization in Databases
Most databases store Unicode text as raw bytes and do not normalize on write. This means your application is responsible for normalization.
| Database | Behavior | Collation Notes |
|---|---|---|
| PostgreSQL | Stores bytes as-is. ICU-based collations (available since v10) can perform normalization-aware comparison, but stored data is not normalized. | Use CREATE COLLATION with ICU for accent-insensitive searches. Normalize to NFC in your application before writes. |
| MySQL | Stores raw bytes with utf8mb4. Collations handle comparison but not storage normalization. |
utf8mb4_unicode_ci treats NFC and NFD equivalents as equal in comparisons, but UNIQUE constraints operate on stored bytes. |
| SQLite | No built-in normalization. Binary comparison by default. | Load the ICU extension for Unicode-aware collation, or normalize in your application. |
Cross-Platform Normalization Differences
Operating systems handle filename normalization differently, which causes bugs in cross-platform applications and file-syncing tools.
| Platform | Filesystem Behavior | Impact |
|---|---|---|
| macOS (HFS+/APFS) | Converts filenames to a variant of NFD on disk. Filenames that are NFC-equivalent map to the same file. | A file created as caf\u00e9.txt and one created as cafe\u0301.txt refer to the same file. Directory listings return the decomposed form. |
| Windows (NTFS) | Preserves the original byte sequence. No normalization is applied. | NFC and NFD filenames can coexist as separate files. Most Windows applications produce NFC text, so this rarely causes issues on Windows alone. |
| Linux (ext4, XFS, Btrfs) | Filenames are opaque byte sequences. No normalization, no case folding. | Two filenames that look identical but differ in normalization form are treated as entirely separate files. This causes sync conflicts with macOS. |
If your application handles files across platforms, normalize filenames to NFC before storing paths in a database or performing lookups.
W3C and WHATWG Recommendations
The W3C Character Model specification recommends NFC for all web content. WHATWG specifications for HTML and the URL standard also assume NFC. Specifically:
- HTML documents should use NFC for text content. The HTML spec does not enforce this, but mixed forms cause interoperability issues with search, indexing, and comparison.
- URLs use percent-encoding for non-ASCII characters. The URL standard expects the text to be NFC before percent-encoding. Encoding NFD text produces longer URLs with different byte sequences for the same visible characters.
- Form submissions send text as-is. If a user's browser submits NFD text (common on macOS), the server receives NFD unless it normalizes on ingestion.
When Not to Mix Forms
Do not keep mixed NFC and NFD strings in the same database, index, or comparison workflow. That creates subtle bugs: duplicate-looking values, failed lookups, incorrect length checks, and inconsistent slugs. Normalize at input boundaries and normalize again before equality-sensitive comparisons if necessary.
Common Bugs Caused by Mixed Forms
| Bug | What causes it | Fix |
|---|---|---|
| Database lookup fails for accented text | Stored values and query values use different forms | Normalize both sides to NFC |
| Regex behaves strangely on accented words | Combining marks are separate code points in NFD | Normalize first or use Unicode-aware regex |
| Character counts look wrong | NFD increases visible-character length at the code-point level | Normalize to NFC before counting, then inspect with the Text Analysis Tool |
| Slug output changes between systems | Different normalization forms reach the slug step | Normalize to NFC before slug creation |
Bug summary: Mixed normalization forms most often break lookup consistency, regex behavior on accented text, character counts, and slug output across systems.
Code Examples
JavaScript
const composed = "\u00E9";
const decomposed = "e\u0301";
console.log(composed === decomposed); // false
console.log(decomposed.normalize("NFC") === composed); // true
function normalizeForStorage(value) {
return value.normalize("NFC").trim();
}
// NFKC for search indexing (folds ligatures, fullwidth, etc.)
function normalizeForSearch(value) {
return value.normalize("NFKC").toLowerCase().trim();
}
Python
import unicodedata
composed = "\u00E9"
decomposed = "e\u0301"
print(composed == decomposed) # False
print(unicodedata.normalize("NFC", decomposed) == composed) # True
def normalize_for_storage(value):
return unicodedata.normalize("NFC", value).strip()
# NFKC for search indexing
def normalize_for_search(value):
return unicodedata.normalize("NFKC", value).casefold().strip()
# Accent stripping via NFD
def strip_accents(value):
nfd = unicodedata.normalize("NFD", value)
return ''.join(c for c in nfd if unicodedata.category(c) != 'Mn')
FAQ
Should I store text as NFC?
Yes, in almost all web, API, database, and search cases. NFC is the safe default for production text handling.
Does JavaScript normalize strings automatically?
No. You must call .normalize("NFC") or another normalization form explicitly.
When is NFD useful?
NFD is useful when you intentionally need decomposed characters, such as accent stripping or combining-mark inspection.
Can this affect slugs and search?
Yes. Mixed normalization forms can produce inconsistent slugs and reduce matching consistency in search and indexes.
What is the difference between NFC and NFKC?
NFC composes characters but preserves compatibility distinctions (ligatures, fullwidth forms, superscripts). NFKC applies compatibility decomposition first, folding those distinctions away. Use NFC for storage, NFKC for search indexing or identifier comparison.
Why do files from macOS cause normalization issues on Linux?
macOS stores filenames in a variant of NFD, while Linux treats filenames as raw byte sequences. A file created on macOS with an accented name arrives on Linux in NFD form. If your Linux code expects NFC, the filename will not match. Normalize filenames to NFC when reading from cross-platform sources.
Related Tools
- Text Analysis Tool to inspect text length and spot suspicious character-count behavior
- URL Slug Generator to create clean slugs after normalization
- URL Encoder / Decoder to inspect encoded Unicode in URLs and query strings
- Remove Line Breaks to clean copied text before further normalization work
- Regex Tester Online to test regex patterns on normalized text
Related Guides
- utf8proc NFC Normalization in C for implementing normalization in C projects
- Hidden Unicode Characters Guide for detecting and removing invisible characters
- Regex Debugging Guide for troubleshooting patterns that interact with Unicode text