NFC vs NFD: Unicode Normalization Explained

Quick Answer: NFC vs NFD

Use NFC for storage, APIs, HTML, search indexing, and most production text handling. Use NFD only when you intentionally need decomposed characters for analysis, accent stripping, or compatibility with systems that store decomposed filenames. If you are unsure, choose NFC.

What Unicode Normalization Is

Unicode normalization converts equivalent strings into a consistent form so visually identical text behaves consistently in code, search, storage, and comparison. Without normalization, two strings that look the same on screen can fail equality checks because they are stored as different code-point sequences.

The most common comparison is NFC vs NFD. NFC keeps characters in composed form where possible. NFD decomposes them into base characters plus combining marks. Both are valid Unicode. They are not interchangeable unless you normalize them first.

Composed vs Decomposed: Visual Code Point Comparison

The character "e with acute accent" can be stored two ways in Unicode. Both render identically on screen, but they are different byte sequences.

Form	Display	Code Points	Bytes (UTF-8)
NFC (composed)	é	`U+00E9` (1 code point)	`C3 A9` (2 bytes)
NFD (decomposed)	é	`U+0065 U+0301` (2 code points)	`65 CC 81` (3 bytes)

More complex characters have larger differences. The Korean syllable "ga" in NFC is a single code point (U+AC00), while in NFD it decomposes to three code points (U+1100 U+1161). Characters with multiple combining marks, like "o with diaeresis and macron," can decompose to three or more code points.

NFC vs NFD Side by Side

Property	NFC	NFD
How `é` is stored	Single composed code point	Base letter + combining accent
Best default for web apps	Yes	No
Best for storage and indexing	Yes	Usually no
Best for decomposition work	No	Yes
String equality without normalization	NFC != NFD even when the text looks identical

Comparison summary: NFC is the default form for storage, indexing, APIs, and most web application text. NFD is mainly useful when you intentionally need decomposed characters for analysis or accent stripping.

NFKC and NFKD: Compatibility Normalization

Beyond NFC and NFD, Unicode defines two compatibility forms: NFKC and NFKD. These replace compatibility characters with their canonical equivalents in addition to composing or decomposing.

Form	Composition	Compatibility Mapping	Use Case
NFC	Composed	No	General storage, APIs, web content
NFD	Decomposed	No	Accent stripping, character analysis
NFKC	Composed	Yes	Search indexing, username comparison, identifier matching
NFKD	Decomposed	Yes	Full decomposition for analysis

Compatibility normalization folds characters that are semantically equivalent but visually distinct. Examples include:

Fullwidth digits: １ (U+FF11) becomes 1 (U+0031)
Ligatures: ﬁ (U+FB01, fi ligature) becomes fi (two characters)
Superscripts: ² (U+00B2) becomes 2 (U+0032)
Circled letters: Ⓐ (U+24B6) becomes A (U+0041)

Use NFKC when you need maximum folding for search or identity comparison. Do not use it for display or storage where the visual distinction matters, because the normalization is lossy and cannot be reversed.

When to Use NFC

Database storage: normalize user input before insert and comparison
Search indexing: avoid missed matches on accented text
API payloads: keep signatures, checksums, and comparisons consistent
Slug generation: normalize before transliteration in the URL Slug Generator
General web content: use NFC as the safe default for production text

When to Use NFD

Accent stripping: decompose first, then remove combining marks
Character-level analysis: inspect combining marks directly
Special compatibility cases: work with systems that already store decomposed text

Normalization in Databases

Most databases store Unicode text as raw bytes and do not normalize on write. This means your application is responsible for normalization.

Database	Behavior	Collation Notes
PostgreSQL	Stores bytes as-is. ICU-based collations (available since v10) can perform normalization-aware comparison, but stored data is not normalized.	Use `CREATE COLLATION` with ICU for accent-insensitive searches. Normalize to NFC in your application before writes.
MySQL	Stores raw bytes with `utf8mb4`. Collations handle comparison but not storage normalization.	`utf8mb4_unicode_ci` treats NFC and NFD equivalents as equal in comparisons, but `UNIQUE` constraints operate on stored bytes.
SQLite	No built-in normalization. Binary comparison by default.	Load the ICU extension for Unicode-aware collation, or normalize in your application.

Cross-Platform Normalization Differences

Operating systems handle filename normalization differently, which causes bugs in cross-platform applications and file-syncing tools.

Platform	Filesystem Behavior	Impact
macOS (HFS+/APFS)	Converts filenames to a variant of NFD on disk. Filenames that are NFC-equivalent map to the same file.	A file created as `caf\u00e9.txt` and one created as `cafe\u0301.txt` refer to the same file. Directory listings return the decomposed form.
Windows (NTFS)	Preserves the original byte sequence. No normalization is applied.	NFC and NFD filenames can coexist as separate files. Most Windows applications produce NFC text, so this rarely causes issues on Windows alone.
Linux (ext4, XFS, Btrfs)	Filenames are opaque byte sequences. No normalization, no case folding.	Two filenames that look identical but differ in normalization form are treated as entirely separate files. This causes sync conflicts with macOS.

If your application handles files across platforms, normalize filenames to NFC before storing paths in a database or performing lookups.

W3C and WHATWG Recommendations

The W3C Character Model specification recommends NFC for all web content. WHATWG specifications for HTML and the URL standard also assume NFC. Specifically:

HTML documents should use NFC for text content. The HTML spec does not enforce this, but mixed forms cause interoperability issues with search, indexing, and comparison.
URLs use percent-encoding for non-ASCII characters. The URL standard expects the text to be NFC before percent-encoding. Encoding NFD text produces longer URLs with different byte sequences for the same visible characters.
Form submissions send text as-is. If a user's browser submits NFD text (common on macOS), the server receives NFD unless it normalizes on ingestion.

When Not to Mix Forms

Do not keep mixed NFC and NFD strings in the same database, index, or comparison workflow. That creates subtle bugs: duplicate-looking values, failed lookups, incorrect length checks, and inconsistent slugs. Normalize at input boundaries and normalize again before equality-sensitive comparisons if necessary.

Common Bugs Caused by Mixed Forms

Bug	What causes it	Fix
Database lookup fails for accented text	Stored values and query values use different forms	Normalize both sides to NFC
Regex behaves strangely on accented words	Combining marks are separate code points in NFD	Normalize first or use Unicode-aware regex
Character counts look wrong	NFD increases visible-character length at the code-point level	Normalize to NFC before counting, then inspect with the Text Analysis Tool
Slug output changes between systems	Different normalization forms reach the slug step	Normalize to NFC before slug creation

Bug summary: Mixed normalization forms most often break lookup consistency, regex behavior on accented text, character counts, and slug output across systems.

Code Examples

JavaScript

const composed = "\u00E9";
const decomposed = "e\u0301";

console.log(composed === decomposed); // false
console.log(decomposed.normalize("NFC") === composed); // true

function normalizeForStorage(value) {
  return value.normalize("NFC").trim();
}

// NFKC for search indexing (folds ligatures, fullwidth, etc.)
function normalizeForSearch(value) {
  return value.normalize("NFKC").toLowerCase().trim();
}

Python

import unicodedata

composed = "\u00E9"
decomposed = "e\u0301"

print(composed == decomposed)  # False
print(unicodedata.normalize("NFC", decomposed) == composed)  # True


def normalize_for_storage(value):
    return unicodedata.normalize("NFC", value).strip()


# NFKC for search indexing
def normalize_for_search(value):
    return unicodedata.normalize("NFKC", value).casefold().strip()


# Accent stripping via NFD
def strip_accents(value):
    nfd = unicodedata.normalize("NFD", value)
    return ''.join(c for c in nfd if unicodedata.category(c) != 'Mn')

FAQ

Should I store text as NFC?

Yes, in almost all web, API, database, and search cases. NFC is the safe default for production text handling.

Does JavaScript normalize strings automatically?

No. You must call .normalize("NFC") or another normalization form explicitly.

When is NFD useful?

NFD is useful when you intentionally need decomposed characters, such as accent stripping or combining-mark inspection.

Can this affect slugs and search?

Yes. Mixed normalization forms can produce inconsistent slugs and reduce matching consistency in search and indexes.

What is the difference between NFC and NFKC?

NFC composes characters but preserves compatibility distinctions (ligatures, fullwidth forms, superscripts). NFKC applies compatibility decomposition first, folding those distinctions away. Use NFC for storage, NFKC for search indexing or identifier comparison.

Why do files from macOS cause normalization issues on Linux?

macOS stores filenames in a variant of NFD, while Linux treats filenames as raw byte sequences. A file created on macOS with an accented name arrives on Linux in NFD form. If your Linux code expects NFC, the filename will not match. Normalize filenames to NFC when reading from cross-platform sources.

Related Tools

Text Analysis Tool to inspect text length and spot suspicious character-count behavior
URL Slug Generator to create clean slugs after normalization
URL Encoder / Decoder to inspect encoded Unicode in URLs and query strings
Remove Line Breaks to clean copied text before further normalization work
Regex Tester Online to test regex patterns on normalized text

Related Guides

utf8proc NFC Normalization in C for implementing normalization in C projects
Hidden Unicode Characters Guide for detecting and removing invisible characters
Regex Debugging Guide for troubleshooting patterns that interact with Unicode text