Quick Answer
If your C project handles UTF-8 text and you need stable comparisons, storage, or slug generation, normalize to NFC with utf8proc before you compare, store, or transform the string. If you are unsure which form to use, choose NFC.
What utf8proc Is
utf8proc is a Unicode processing library for UTF-8 text. It is commonly used in systems code to normalize strings, inspect code points, strip compatibility differences, and handle Unicode safely without converting your pipeline into ad hoc byte hacks.
For most web and application workloads, the practical job is simple: take incoming UTF-8 text, normalize it to NFC, then continue with comparison, indexing, storage, or slug generation.
ICU vs utf8proc Comparison
The two main options for Unicode normalization in C/C++ are ICU and utf8proc. They differ significantly in scope, size, and use case.
| Feature | ICU | utf8proc |
|---|---|---|
| Library size | ~30 MB (data files included) | ~300 KB (single .c and .h file) |
| Normalization (NFC, NFD, NFKC, NFKD) | Yes | Yes |
| Collation and locale-aware sorting | Yes | No |
| Bidirectional text | Yes | No |
| Date/number formatting | Yes | No |
| Regular expressions | Yes | No |
| License | ICU License (permissive) | MIT |
| Dependencies | Multiple data files, build system | None (self-contained) |
| Best for | Full i18n applications | Focused normalization, embedded systems, CLI tools |
Choose utf8proc when you need normalization without the weight of a full internationalization stack. Choose ICU when you also need collation, locale-aware formatting, or bidirectional text support.
Why NFC Matters in C Projects
- String comparison: visibly identical text can compare unequal if forms differ
- Database consistency: mixed forms create duplicate-looking values
- Search indexing: accented queries can miss matches if input is inconsistent
- Slug generation: transliteration gets less predictable when normalization is skipped
When to Normalize
Normalize at the point where text enters your system, not only when you notice a bug later. Good boundaries include API ingestion, file parsing, CLI input, database writes, and search indexing pipelines. The earlier you normalize, the fewer mixed-form bugs you carry downstream.
Real C Code Example with utf8proc
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <utf8proc.h>
char *normalize_nfc(const char *input) {
utf8proc_uint8_t *result = NULL;
utf8proc_ssize_t len;
len = utf8proc_map(
(const utf8proc_uint8_t *)input,
0, /* 0 means null-terminated input */
&result,
UTF8PROC_NULLTERM | UTF8PROC_STABLE |
UTF8PROC_COMPOSE /* NFC = compose */
);
if (len < 0) {
fprintf(stderr, "normalization failed: %s\n",
utf8proc_errmsg(len));
return NULL;
}
return (char *)result; /* caller must free() */
}
int main(void) {
const char *decomposed = "e\xCC\x81"; /* e + combining acute = NFD */
const char *composed = "\xC3\xA9"; /* precomposed e-acute = NFC */
char *normalized = normalize_nfc(decomposed);
if (normalized) {
printf("match: %s\n",
strcmp(normalized, composed) == 0 ? "yes" : "no");
free(normalized);
}
return 0;
}
/* Compile: gcc -o norm norm.c -lutf8proc */
The key flag is UTF8PROC_COMPOSE, which produces NFC output. For NFD, use UTF8PROC_DECOMPOSE instead. For NFKC, combine UTF8PROC_COMPOSE | UTF8PROC_COMPAT.
Platform-Specific Normalization Behavior
Different operating systems handle Unicode normalization differently at the filesystem level. This matters when your C program reads filenames or processes paths.
| Platform | Filesystem Normalization | Implications |
|---|---|---|
| macOS (HFS+/APFS) | NFD (a variant close to NFD) | Filenames are decomposed on disk. A file created as cafe\u0301 and caf\u00e9 refer to the same file. Listing the directory returns the NFD form. |
| Windows (NTFS) | Preserves original form (usually NFC) | Windows does not normalize filenames. Two files with NFC and NFD names can coexist as separate entries. |
| Linux (ext4, XFS) | Byte-preserving (no normalization) | Filenames are raw byte sequences. NFC and NFD filenames are distinct files. This causes cross-platform bugs when syncing with macOS. |
If your C program processes filenames across platforms, normalize them to NFC before comparison or storage. Otherwise, the same file can appear as two different entries when synced between macOS and Linux.
Database Storage Implications
Database engines vary in their Unicode handling. Normalizing in your application layer with utf8proc gives you consistent behavior regardless of the database backend.
| Database | Default Behavior | Recommendation |
|---|---|---|
| PostgreSQL | Stores bytes as-is. Comparison depends on collation. ICU collations can normalize, but the default C collation does byte comparison. | Normalize to NFC in your application before INSERT. |
| MySQL | The utf8mb4 charset stores raw bytes. Collations like utf8mb4_unicode_ci handle case-insensitive comparison but do not normalize stored data. |
Normalize to NFC before storage. Do not rely on collation to fix normalization differences. |
| SQLite | Stores raw bytes. No built-in normalization. Comparison is binary unless you load an ICU extension. | Normalize to NFC in your application before any write or comparison. |
Migration Strategy: Adding Normalization to Existing Projects
If you have an existing C project with no normalization, adding utf8proc takes four steps:
- Add the dependency: utf8proc is available as a system package on most Linux distributions (
libut8proc-dev), via Homebrew on macOS (brew install utf8proc), or as a vendored single-file include. - Identify input boundaries: find every point where external text enters your system (file reads, API responses, CLI arguments, database reads). These are your normalization points.
- Add normalization calls: wrap incoming text with your
normalize_nfc()function at each boundary. Return the normalized copy and free the original if needed. - Migrate stored data: if your database already contains mixed-form text, run a one-time migration that reads each text value, normalizes it to NFC, and writes it back. For PostgreSQL, you can do this with a custom function or a script that reads rows and updates them.
What to Watch Out For
| Pitfall | Why it happens | Fix |
|---|---|---|
| Comparing raw UTF-8 bytes directly | Equivalent visible text can have different code-point sequences | Normalize both strings to NFC first |
| Normalizing some inputs but not others | Mixed-form data spreads across the system | Normalize at every input boundary |
| Generating slugs before normalization | Accented characters may decompose differently | Normalize first, then generate the slug |
| Confusing NFC with NFKC | Compatibility normalization can change meaning more aggressively | Use NFC unless you specifically need compatibility folding |
| Forgetting to free the result | utf8proc_map allocates a new buffer |
Always free() the returned pointer when done |
When Not to Use NFC
Do not assume NFC is the answer when you specifically need decomposition for analysis. Accent stripping and combining-mark inspection often start with NFD or NFKD. For general storage and comparison, though, NFC remains the safer default.
FAQ
Should I normalize all UTF-8 text to NFC in my C project?
For storage, comparison, APIs, and search-oriented text handling, yes. It is the safest default in most application pipelines.
Does utf8proc solve hidden character problems too?
It helps with normalization, but hidden characters such as zero-width spaces or BOM markers may still need explicit inspection and cleanup. See the hidden Unicode characters guide for detection and stripping patterns.
Can normalization affect slugs?
Yes. Skipping normalization can produce inconsistent slugs for accented titles that look identical to users.
What is the difference between NFC and NFKC in utf8proc?
NFC composes characters to their precomposed form. NFKC does the same but also replaces compatibility characters (like ligatures, superscripts, and fullwidth forms) with their standard equivalents. Use NFKC when you want maximum folding, such as for search indexing. Use NFC when you want to preserve visual distinctions.
How large is utf8proc compared to ICU?
utf8proc is roughly 300 KB including its Unicode data tables. ICU can be 30 MB or more. If you only need normalization and basic Unicode properties, utf8proc is the lighter choice by two orders of magnitude.
Does utf8proc support the latest Unicode version?
utf8proc tracks Unicode releases, but there can be a lag. Check the version constant UTF8PROC_UNICODE_VERSION in the header to confirm which Unicode version your build supports. For most normalization work, version differences rarely affect results.
Related Tools
- URL Slug Generator to create stable slugs after normalization
- Text Analysis Tool to inspect suspicious character-count behavior
- URL Encoder / Decoder to inspect encoded Unicode in URLs
- Regex Tester Online to test patterns on normalized text
Related Guides
- NFC vs NFD: Unicode Normalization Explained for the conceptual overview
- Hidden Unicode Characters Guide for detecting and stripping invisible characters
- Regex Debugging Guide for troubleshooting patterns that interact with Unicode