FORMATFORGE // KNOWLEDGE_BASE

How to Normalize UTF-8 to NFC with utf8proc

Runs locally in your browser Updated: April 2026 No data upload required

Quick Answer

If your C project handles UTF-8 text and you need stable comparisons, storage, or slug generation, normalize to NFC with utf8proc before you compare, store, or transform the string. If you are unsure which form to use, choose NFC.

What utf8proc Is

utf8proc is a Unicode processing library for UTF-8 text. It is commonly used in systems code to normalize strings, inspect code points, strip compatibility differences, and handle Unicode safely without converting your pipeline into ad hoc byte hacks.

For most web and application workloads, the practical job is simple: take incoming UTF-8 text, normalize it to NFC, then continue with comparison, indexing, storage, or slug generation.

ICU vs utf8proc Comparison

The two main options for Unicode normalization in C/C++ are ICU and utf8proc. They differ significantly in scope, size, and use case.

Feature ICU utf8proc
Library size ~30 MB (data files included) ~300 KB (single .c and .h file)
Normalization (NFC, NFD, NFKC, NFKD) Yes Yes
Collation and locale-aware sorting Yes No
Bidirectional text Yes No
Date/number formatting Yes No
Regular expressions Yes No
License ICU License (permissive) MIT
Dependencies Multiple data files, build system None (self-contained)
Best for Full i18n applications Focused normalization, embedded systems, CLI tools

Choose utf8proc when you need normalization without the weight of a full internationalization stack. Choose ICU when you also need collation, locale-aware formatting, or bidirectional text support.

Why NFC Matters in C Projects

When to Normalize

Normalize at the point where text enters your system, not only when you notice a bug later. Good boundaries include API ingestion, file parsing, CLI input, database writes, and search indexing pipelines. The earlier you normalize, the fewer mixed-form bugs you carry downstream.

Real C Code Example with utf8proc

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <utf8proc.h>

char *normalize_nfc(const char *input) {
    utf8proc_uint8_t *result = NULL;
    utf8proc_ssize_t len;

    len = utf8proc_map(
        (const utf8proc_uint8_t *)input,
        0,  /* 0 means null-terminated input */
        &result,
        UTF8PROC_NULLTERM | UTF8PROC_STABLE |
        UTF8PROC_COMPOSE  /* NFC = compose */
    );

    if (len < 0) {
        fprintf(stderr, "normalization failed: %s\n",
                utf8proc_errmsg(len));
        return NULL;
    }

    return (char *)result;  /* caller must free() */
}

int main(void) {
    const char *decomposed = "e\xCC\x81";  /* e + combining acute = NFD */
    const char *composed   = "\xC3\xA9";   /* precomposed e-acute = NFC */

    char *normalized = normalize_nfc(decomposed);
    if (normalized) {
        printf("match: %s\n",
               strcmp(normalized, composed) == 0 ? "yes" : "no");
        free(normalized);
    }
    return 0;
}
/* Compile: gcc -o norm norm.c -lutf8proc */

The key flag is UTF8PROC_COMPOSE, which produces NFC output. For NFD, use UTF8PROC_DECOMPOSE instead. For NFKC, combine UTF8PROC_COMPOSE | UTF8PROC_COMPAT.

Platform-Specific Normalization Behavior

Different operating systems handle Unicode normalization differently at the filesystem level. This matters when your C program reads filenames or processes paths.

Platform Filesystem Normalization Implications
macOS (HFS+/APFS) NFD (a variant close to NFD) Filenames are decomposed on disk. A file created as cafe\u0301 and caf\u00e9 refer to the same file. Listing the directory returns the NFD form.
Windows (NTFS) Preserves original form (usually NFC) Windows does not normalize filenames. Two files with NFC and NFD names can coexist as separate entries.
Linux (ext4, XFS) Byte-preserving (no normalization) Filenames are raw byte sequences. NFC and NFD filenames are distinct files. This causes cross-platform bugs when syncing with macOS.

If your C program processes filenames across platforms, normalize them to NFC before comparison or storage. Otherwise, the same file can appear as two different entries when synced between macOS and Linux.

Database Storage Implications

Database engines vary in their Unicode handling. Normalizing in your application layer with utf8proc gives you consistent behavior regardless of the database backend.

Database Default Behavior Recommendation
PostgreSQL Stores bytes as-is. Comparison depends on collation. ICU collations can normalize, but the default C collation does byte comparison. Normalize to NFC in your application before INSERT.
MySQL The utf8mb4 charset stores raw bytes. Collations like utf8mb4_unicode_ci handle case-insensitive comparison but do not normalize stored data. Normalize to NFC before storage. Do not rely on collation to fix normalization differences.
SQLite Stores raw bytes. No built-in normalization. Comparison is binary unless you load an ICU extension. Normalize to NFC in your application before any write or comparison.

Migration Strategy: Adding Normalization to Existing Projects

If you have an existing C project with no normalization, adding utf8proc takes four steps:

  1. Add the dependency: utf8proc is available as a system package on most Linux distributions (libut8proc-dev), via Homebrew on macOS (brew install utf8proc), or as a vendored single-file include.
  2. Identify input boundaries: find every point where external text enters your system (file reads, API responses, CLI arguments, database reads). These are your normalization points.
  3. Add normalization calls: wrap incoming text with your normalize_nfc() function at each boundary. Return the normalized copy and free the original if needed.
  4. Migrate stored data: if your database already contains mixed-form text, run a one-time migration that reads each text value, normalizes it to NFC, and writes it back. For PostgreSQL, you can do this with a custom function or a script that reads rows and updates them.

What to Watch Out For

Pitfall Why it happens Fix
Comparing raw UTF-8 bytes directly Equivalent visible text can have different code-point sequences Normalize both strings to NFC first
Normalizing some inputs but not others Mixed-form data spreads across the system Normalize at every input boundary
Generating slugs before normalization Accented characters may decompose differently Normalize first, then generate the slug
Confusing NFC with NFKC Compatibility normalization can change meaning more aggressively Use NFC unless you specifically need compatibility folding
Forgetting to free the result utf8proc_map allocates a new buffer Always free() the returned pointer when done

When Not to Use NFC

Do not assume NFC is the answer when you specifically need decomposition for analysis. Accent stripping and combining-mark inspection often start with NFD or NFKD. For general storage and comparison, though, NFC remains the safer default.

FAQ

Should I normalize all UTF-8 text to NFC in my C project?

For storage, comparison, APIs, and search-oriented text handling, yes. It is the safest default in most application pipelines.

Does utf8proc solve hidden character problems too?

It helps with normalization, but hidden characters such as zero-width spaces or BOM markers may still need explicit inspection and cleanup. See the hidden Unicode characters guide for detection and stripping patterns.

Can normalization affect slugs?

Yes. Skipping normalization can produce inconsistent slugs for accented titles that look identical to users.

What is the difference between NFC and NFKC in utf8proc?

NFC composes characters to their precomposed form. NFKC does the same but also replaces compatibility characters (like ligatures, superscripts, and fullwidth forms) with their standard equivalents. Use NFKC when you want maximum folding, such as for search indexing. Use NFC when you want to preserve visual distinctions.

How large is utf8proc compared to ICU?

utf8proc is roughly 300 KB including its Unicode data tables. ICU can be 30 MB or more. If you only need normalization and basic Unicode properties, utf8proc is the lighter choice by two orders of magnitude.

Does utf8proc support the latest Unicode version?

utf8proc tracks Unicode releases, but there can be a lag. Check the version constant UTF8PROC_UNICODE_VERSION in the header to confirm which Unicode version your build supports. For most normalization work, version differences rarely affect results.

Related Tools

Related Guides