FORMATFORGE // KNOWLEDGE_BASE

How to Detect Hidden Unicode Characters in Text

Runs locally in your browser Updated: April 2026 No data upload required

Quick Answer

Hidden Unicode characters such as zero-width spaces, BOM markers, and combining marks can break matching, inflate counts, corrupt slugs, and make copied text behave strangely. If text looks normal but compares wrong or counts wrong, inspect it for hidden characters first.

What Counts as a Hidden Character

Common Hidden Characters Reference

Character Code Point Name Typical Source
(invisible) U+200B Zero-Width Space Web editors, CMS platforms, copy-paste from browsers
(invisible) U+FEFF Byte Order Mark (BOM) Windows Notepad, UTF-8 files saved with BOM
(invisible) U+200C Zero-Width Non-Joiner (ZWNJ) Persian/Arabic text, web content
(invisible) U+200D Zero-Width Joiner (ZWJ) Emoji sequences, Indic scripts, web editors
(invisible) U+00AD Soft Hyphen Word processors, HTML entities, CMS auto-hyphenation
(invisible) U+2060 Word Joiner Typesetting software, rich text editors
(invisible) U+200E Left-to-Right Mark Bidirectional text, mixed-language documents
(invisible) U+200F Right-to-Left Mark Arabic/Hebrew text processing

Signs You Have Hidden Unicode Problems

Symptom Possible cause
Word count looks too high or too low Invisible separators or zero-width characters
String equality fails for identical-looking text Mixed normalization forms or hidden marks
Slug output changes between systems Hidden characters or mixed Unicode forms in the title
Regex or search misses terms unexpectedly Invisible characters interrupt matching
JSON or XML parsing fails on clean-looking input BOM at file start or zero-width chars inside keys
Cursor jumps oddly when arrowing through text Zero-width characters occupy cursor positions without rendering

How to Detect Them

  1. Paste the text into the Text Analysis Tool and compare visible length with character counts.
  2. If counts seem wrong, normalize the text and inspect suspicious boundaries.
  3. Clean copied formatting and line artifacts with Remove Line Breaks if the text came from PDFs, emails, or rich editors.
  4. Review Unicode normalization if the issue involves accents or visually identical characters.

Detection with a Hex Dump

The fastest way to confirm hidden characters is a hex dump. On Linux or macOS, pipe your text through xxd or od and look for byte sequences that correspond to zero-width code points. For example, U+200B encodes as E2 80 8B in UTF-8, and U+FEFF encodes as EF BB BF.

# Show hex bytes of a file
xxd suspicious.txt | head -20

# Search for zero-width space (E2 80 8B) in a file
grep -P '\xe2\x80\x8b' suspicious.txt

Programmatic Detection in JavaScript

// Detect common hidden Unicode characters
function findHiddenChars(str) {
  const hidden = /[\u200B\u200C\u200D\u200E\u200F\uFEFF\u00AD\u2060\u2028\u2029]/g;
  const matches = [];
  let match;
  while ((match = hidden.exec(str)) !== null) {
    matches.push({
      char: match[0],
      codePoint: 'U+' + match[0].charCodeAt(0).toString(16).toUpperCase().padStart(4, '0'),
      position: match.index
    });
  }
  return matches;
}

// Example usage
const text = "hello\u200Bworld";
console.log(findHiddenChars(text));
// [{char: '', codePoint: 'U+200B', position: 5}]

Programmatic Detection in Python

import re

HIDDEN_PATTERN = re.compile(
    '[\u200b\u200c\u200d\u200e\u200f\ufeff\u00ad\u2060\u2028\u2029]'
)

def find_hidden_chars(text):
    results = []
    for match in HIDDEN_PATTERN.finditer(text):
        results.append({
            'codepoint': f'U+{ord(match.group()):04X}',
            'position': match.start(),
        })
    return results

# Example
text = "hello\u200bworld"
print(find_hidden_chars(text))
# [{'codepoint': 'U+200B', 'position': 5}]

Automated Stripping Patterns

Once you have confirmed which hidden characters are present, you can strip them. The safest approach is to target specific code points rather than broad Unicode categories, since some zero-width characters are intentional in certain scripts.

JavaScript

// Remove common zero-width and invisible characters
function stripHiddenChars(str) {
  return str.replace(/[\u200B\u200C\u200D\uFEFF\u00AD\u2060]/g, '');
}

// Remove BOM from start of string only
function stripBOM(str) {
  return str.replace(/^\uFEFF/, '');
}

Python

import re

def strip_hidden_chars(text):
    return re.sub('[\u200b\u200c\u200d\ufeff\u00ad\u2060]', '', text)

def strip_bom(text):
    return text.lstrip('\ufeff')

Homoglyph Attacks and Security Implications

Hidden and look-alike characters create real security risks. Homoglyphs are visually identical characters from different Unicode blocks. An attacker can register a domain using Cyrillic a (U+0430) instead of Latin a (U+0061), creating a domain that looks legitimate but points elsewhere. The same technique works in usernames, email addresses, and file names.

Zero-width characters add another attack surface. A username containing a zero-width space looks identical to one without, but they are distinct strings in a database. This can enable impersonation or bypass uniqueness constraints. Always normalize and strip invisible characters from identity-sensitive fields before storage and comparison.

Real-World Scenarios Where Hidden Characters Cause Bugs

When to Normalize and When to Strip

Normalize when the problem is inconsistent Unicode representation, such as NFC vs NFD. Strip characters only when they are truly unwanted in your workflow, such as zero-width spaces in copied content or BOM markers in imported text.

Common Mistakes

FAQ

Can hidden characters affect SEO and slugs?

Yes. They can create unstable slug output, odd token boundaries, and inconsistent matching. A zero-width space in a page title can produce a slug with an invisible segment that breaks URL resolution on some servers.

Is a BOM always bad?

No, but it is often unwanted in application text, config payloads, and copy-paste workflows where hidden leading characters can break parsing. Some Windows tools expect a BOM in UTF-8 files, but most modern software does not require it.

Are combining accents hidden characters?

They are not invisible in the same way as zero-width characters, but they can still create hidden structural differences in text when stored separately.

How do zero-width characters get into my text?

The most common sources are copy-paste from web pages, word processors, CMS editors, and PDF extraction. Some rich text editors insert zero-width spaces to manage cursor positioning or word breaking internally.

Can I safely strip all hidden characters?

Not always. Zero-width joiners and non-joiners are meaningful in Persian, Arabic, and Indic scripts where they control ligature behavior. Stripping them in multilingual text can break correct rendering. Target only the specific characters that are problematic in your context.

Why does my JSON parsing fail on valid-looking data?

A common cause is a BOM (U+FEFF) at the start of the response body. JSON parsers expect the first character to be { or [, and the invisible BOM byte sequence causes a parse error. Strip the BOM before parsing.

Do hidden characters affect password or username validation?

Yes. A username with a zero-width space can pass visual inspection but fail exact-match lookups, or worse, allow two accounts that look identical. Always normalize and strip invisible characters from authentication-related fields.

Related Tools

Related Guides