How to Detect Hidden Unicode Characters in Text

Quick Answer

Hidden Unicode characters such as zero-width spaces, BOM markers, and combining marks can break matching, inflate counts, corrupt slugs, and make copied text behave strangely. If text looks normal but compares wrong or counts wrong, inspect it for hidden characters first.

What Counts as a Hidden Character

Zero-width space: invisible separator that can split tokens
Zero-width joiner or non-joiner: invisible characters that change shaping or add hidden length
Byte order mark (BOM): hidden marker that can appear at the start of text or files
Combining marks: accents stored separately from base characters in decomposed text
Soft hyphen: invisible break hint that some renderers display and others ignore
Word joiner: invisible no-break marker that prevents line breaks without adding visible width

Common Hidden Characters Reference

Character	Code Point	Name	Typical Source
(invisible)	`U+200B`	Zero-Width Space	Web editors, CMS platforms, copy-paste from browsers
(invisible)	`U+FEFF`	Byte Order Mark (BOM)	Windows Notepad, UTF-8 files saved with BOM
(invisible)	`U+200C`	Zero-Width Non-Joiner (ZWNJ)	Persian/Arabic text, web content
(invisible)	`U+200D`	Zero-Width Joiner (ZWJ)	Emoji sequences, Indic scripts, web editors
(invisible)	`U+00AD`	Soft Hyphen	Word processors, HTML entities, CMS auto-hyphenation
(invisible)	`U+2060`	Word Joiner	Typesetting software, rich text editors
(invisible)	`U+200E`	Left-to-Right Mark	Bidirectional text, mixed-language documents
(invisible)	`U+200F`	Right-to-Left Mark	Arabic/Hebrew text processing

Signs You Have Hidden Unicode Problems

Symptom	Possible cause
Word count looks too high or too low	Invisible separators or zero-width characters
String equality fails for identical-looking text	Mixed normalization forms or hidden marks
Slug output changes between systems	Hidden characters or mixed Unicode forms in the title
Regex or search misses terms unexpectedly	Invisible characters interrupt matching
JSON or XML parsing fails on clean-looking input	BOM at file start or zero-width chars inside keys
Cursor jumps oddly when arrowing through text	Zero-width characters occupy cursor positions without rendering

How to Detect Them

Paste the text into the Text Analysis Tool and compare visible length with character counts.
If counts seem wrong, normalize the text and inspect suspicious boundaries.
Clean copied formatting and line artifacts with Remove Line Breaks if the text came from PDFs, emails, or rich editors.
Review Unicode normalization if the issue involves accents or visually identical characters.

Detection with a Hex Dump

The fastest way to confirm hidden characters is a hex dump. On Linux or macOS, pipe your text through xxd or od and look for byte sequences that correspond to zero-width code points. For example, U+200B encodes as E2 80 8B in UTF-8, and U+FEFF encodes as EF BB BF.

# Show hex bytes of a file
xxd suspicious.txt | head -20

# Search for zero-width space (E2 80 8B) in a file
grep -P '\xe2\x80\x8b' suspicious.txt

Programmatic Detection in JavaScript

// Detect common hidden Unicode characters
function findHiddenChars(str) {
  const hidden = /[\u200B\u200C\u200D\u200E\u200F\uFEFF\u00AD\u2060\u2028\u2029]/g;
  const matches = [];
  let match;
  while ((match = hidden.exec(str)) !== null) {
    matches.push({
      char: match[0],
      codePoint: 'U+' + match[0].charCodeAt(0).toString(16).toUpperCase().padStart(4, '0'),
      position: match.index
    });
  }
  return matches;
}

// Example usage
const text = "hello\u200Bworld";
console.log(findHiddenChars(text));
// [{char: '', codePoint: 'U+200B', position: 5}]

Programmatic Detection in Python

import re

HIDDEN_PATTERN = re.compile(
    '[\u200b\u200c\u200d\u200e\u200f\ufeff\u00ad\u2060\u2028\u2029]'
)

def find_hidden_chars(text):
    results = []
    for match in HIDDEN_PATTERN.finditer(text):
        results.append({
            'codepoint': f'U+{ord(match.group()):04X}',
            'position': match.start(),
        })
    return results

# Example
text = "hello\u200bworld"
print(find_hidden_chars(text))
# [{'codepoint': 'U+200B', 'position': 5}]

Automated Stripping Patterns

Once you have confirmed which hidden characters are present, you can strip them. The safest approach is to target specific code points rather than broad Unicode categories, since some zero-width characters are intentional in certain scripts.

JavaScript

// Remove common zero-width and invisible characters
function stripHiddenChars(str) {
  return str.replace(/[\u200B\u200C\u200D\uFEFF\u00AD\u2060]/g, '');
}

// Remove BOM from start of string only
function stripBOM(str) {
  return str.replace(/^\uFEFF/, '');
}

Python

import re

def strip_hidden_chars(text):
    return re.sub('[\u200b\u200c\u200d\ufeff\u00ad\u2060]', '', text)

def strip_bom(text):
    return text.lstrip('\ufeff')

Homoglyph Attacks and Security Implications

Hidden and look-alike characters create real security risks. Homoglyphs are visually identical characters from different Unicode blocks. An attacker can register a domain using Cyrillic a (U+0430) instead of Latin a (U+0061), creating a domain that looks legitimate but points elsewhere. The same technique works in usernames, email addresses, and file names.

Zero-width characters add another attack surface. A username containing a zero-width space looks identical to one without, but they are distinct strings in a database. This can enable impersonation or bypass uniqueness constraints. Always normalize and strip invisible characters from identity-sensitive fields before storage and comparison.

Real-World Scenarios Where Hidden Characters Cause Bugs

Copy from Microsoft Word: Word inserts soft hyphens, non-breaking spaces (U+00A0), and sometimes zero-width characters. Pasting into code editors or CMS fields carries these along invisibly.
Web scraping: HTML source may contain , &zwj;, or BOM markers. If your scraper does not strip these, downstream text processing breaks.
API responses: JSON payloads from third-party APIs sometimes include a BOM at the start of the response body. JSON parsers that do not tolerate a leading BOM will fail with a parse error on otherwise valid JSON.
CSV imports: Files saved as "UTF-8 with BOM" in Excel add three invisible bytes at the start. The first column header gains an invisible prefix, causing header-based lookups to fail.
Git diffs and CI: Hidden characters in source code can pass code review undetected. They may cause string comparisons to fail only on certain platforms or produce inconsistent test results.

When to Normalize and When to Strip

Normalize when the problem is inconsistent Unicode representation, such as NFC vs NFD. Strip characters only when they are truly unwanted in your workflow, such as zero-width spaces in copied content or BOM markers in imported text.

Common Mistakes

Assuming visible text equals stored text
Generating slugs before cleaning the title
Comparing raw strings without normalization
Blaming regex first when the input contains invisible characters
Stripping all zero-width characters globally without considering scripts that require them (Persian, Indic languages)

FAQ

Can hidden characters affect SEO and slugs?

Yes. They can create unstable slug output, odd token boundaries, and inconsistent matching. A zero-width space in a page title can produce a slug with an invisible segment that breaks URL resolution on some servers.

Is a BOM always bad?

No, but it is often unwanted in application text, config payloads, and copy-paste workflows where hidden leading characters can break parsing. Some Windows tools expect a BOM in UTF-8 files, but most modern software does not require it.

Are combining accents hidden characters?

They are not invisible in the same way as zero-width characters, but they can still create hidden structural differences in text when stored separately.

How do zero-width characters get into my text?

The most common sources are copy-paste from web pages, word processors, CMS editors, and PDF extraction. Some rich text editors insert zero-width spaces to manage cursor positioning or word breaking internally.

Can I safely strip all hidden characters?

Not always. Zero-width joiners and non-joiners are meaningful in Persian, Arabic, and Indic scripts where they control ligature behavior. Stripping them in multilingual text can break correct rendering. Target only the specific characters that are problematic in your context.

Why does my JSON parsing fail on valid-looking data?

A common cause is a BOM (U+FEFF) at the start of the response body. JSON parsers expect the first character to be { or [, and the invisible BOM byte sequence causes a parse error. Strip the BOM before parsing.

Do hidden characters affect password or username validation?

Yes. A username with a zero-width space can pass visual inspection but fail exact-match lookups, or worse, allow two accounts that look identical. Always normalize and strip invisible characters from authentication-related fields.

Related Tools

Text Analysis Tool to inspect suspicious character counts
URL Slug Generator to create stable slugs after cleanup
URL Encoder / Decoder to inspect encoded characters in URLs
Regex Tester Online to test patterns for detecting hidden characters

Related Guides

NFC vs NFD: Unicode Normalization Explained for understanding composed vs decomposed forms
utf8proc NFC Normalization in C for normalizing text in C projects
Regex Debugging Guide for troubleshooting patterns that fail on hidden characters