Quick Answer
Hidden Unicode characters such as zero-width spaces, BOM markers, and combining marks can break matching, inflate counts, corrupt slugs, and make copied text behave strangely. If text looks normal but compares wrong or counts wrong, inspect it for hidden characters first.
What Counts as a Hidden Character
- Zero-width space: invisible separator that can split tokens
- Zero-width joiner or non-joiner: invisible characters that change shaping or add hidden length
- Byte order mark (BOM): hidden marker that can appear at the start of text or files
- Combining marks: accents stored separately from base characters in decomposed text
- Soft hyphen: invisible break hint that some renderers display and others ignore
- Word joiner: invisible no-break marker that prevents line breaks without adding visible width
Common Hidden Characters Reference
| Character | Code Point | Name | Typical Source |
|---|---|---|---|
| (invisible) | U+200B |
Zero-Width Space | Web editors, CMS platforms, copy-paste from browsers |
| (invisible) | U+FEFF |
Byte Order Mark (BOM) | Windows Notepad, UTF-8 files saved with BOM |
| (invisible) | U+200C |
Zero-Width Non-Joiner (ZWNJ) | Persian/Arabic text, web content |
| (invisible) | U+200D |
Zero-Width Joiner (ZWJ) | Emoji sequences, Indic scripts, web editors |
| (invisible) | U+00AD |
Soft Hyphen | Word processors, HTML entities, CMS auto-hyphenation |
| (invisible) | U+2060 |
Word Joiner | Typesetting software, rich text editors |
| (invisible) | U+200E |
Left-to-Right Mark | Bidirectional text, mixed-language documents |
| (invisible) | U+200F |
Right-to-Left Mark | Arabic/Hebrew text processing |
Signs You Have Hidden Unicode Problems
| Symptom | Possible cause |
|---|---|
| Word count looks too high or too low | Invisible separators or zero-width characters |
| String equality fails for identical-looking text | Mixed normalization forms or hidden marks |
| Slug output changes between systems | Hidden characters or mixed Unicode forms in the title |
| Regex or search misses terms unexpectedly | Invisible characters interrupt matching |
| JSON or XML parsing fails on clean-looking input | BOM at file start or zero-width chars inside keys |
| Cursor jumps oddly when arrowing through text | Zero-width characters occupy cursor positions without rendering |
How to Detect Them
- Paste the text into the Text Analysis Tool and compare visible length with character counts.
- If counts seem wrong, normalize the text and inspect suspicious boundaries.
- Clean copied formatting and line artifacts with Remove Line Breaks if the text came from PDFs, emails, or rich editors.
- Review Unicode normalization if the issue involves accents or visually identical characters.
Detection with a Hex Dump
The fastest way to confirm hidden characters is a hex dump. On Linux or macOS, pipe your text through xxd or od and look for byte sequences that correspond to zero-width code points. For example, U+200B encodes as E2 80 8B in UTF-8, and U+FEFF encodes as EF BB BF.
# Show hex bytes of a file
xxd suspicious.txt | head -20
# Search for zero-width space (E2 80 8B) in a file
grep -P '\xe2\x80\x8b' suspicious.txt
Programmatic Detection in JavaScript
// Detect common hidden Unicode characters
function findHiddenChars(str) {
const hidden = /[\u200B\u200C\u200D\u200E\u200F\uFEFF\u00AD\u2060\u2028\u2029]/g;
const matches = [];
let match;
while ((match = hidden.exec(str)) !== null) {
matches.push({
char: match[0],
codePoint: 'U+' + match[0].charCodeAt(0).toString(16).toUpperCase().padStart(4, '0'),
position: match.index
});
}
return matches;
}
// Example usage
const text = "hello\u200Bworld";
console.log(findHiddenChars(text));
// [{char: '', codePoint: 'U+200B', position: 5}]
Programmatic Detection in Python
import re
HIDDEN_PATTERN = re.compile(
'[\u200b\u200c\u200d\u200e\u200f\ufeff\u00ad\u2060\u2028\u2029]'
)
def find_hidden_chars(text):
results = []
for match in HIDDEN_PATTERN.finditer(text):
results.append({
'codepoint': f'U+{ord(match.group()):04X}',
'position': match.start(),
})
return results
# Example
text = "hello\u200bworld"
print(find_hidden_chars(text))
# [{'codepoint': 'U+200B', 'position': 5}]
Automated Stripping Patterns
Once you have confirmed which hidden characters are present, you can strip them. The safest approach is to target specific code points rather than broad Unicode categories, since some zero-width characters are intentional in certain scripts.
JavaScript
// Remove common zero-width and invisible characters
function stripHiddenChars(str) {
return str.replace(/[\u200B\u200C\u200D\uFEFF\u00AD\u2060]/g, '');
}
// Remove BOM from start of string only
function stripBOM(str) {
return str.replace(/^\uFEFF/, '');
}
Python
import re
def strip_hidden_chars(text):
return re.sub('[\u200b\u200c\u200d\ufeff\u00ad\u2060]', '', text)
def strip_bom(text):
return text.lstrip('\ufeff')
Homoglyph Attacks and Security Implications
Hidden and look-alike characters create real security risks. Homoglyphs are visually identical characters from different Unicode blocks. An attacker can register a domain using Cyrillic a (U+0430) instead of Latin a (U+0061), creating a domain that looks legitimate but points elsewhere. The same technique works in usernames, email addresses, and file names.
Zero-width characters add another attack surface. A username containing a zero-width space looks identical to one without, but they are distinct strings in a database. This can enable impersonation or bypass uniqueness constraints. Always normalize and strip invisible characters from identity-sensitive fields before storage and comparison.
Real-World Scenarios Where Hidden Characters Cause Bugs
- Copy from Microsoft Word: Word inserts soft hyphens, non-breaking spaces (
U+00A0), and sometimes zero-width characters. Pasting into code editors or CMS fields carries these along invisibly. - Web scraping: HTML source may contain
­,‍, or BOM markers. If your scraper does not strip these, downstream text processing breaks. - API responses: JSON payloads from third-party APIs sometimes include a BOM at the start of the response body. JSON parsers that do not tolerate a leading BOM will fail with a parse error on otherwise valid JSON.
- CSV imports: Files saved as "UTF-8 with BOM" in Excel add three invisible bytes at the start. The first column header gains an invisible prefix, causing header-based lookups to fail.
- Git diffs and CI: Hidden characters in source code can pass code review undetected. They may cause string comparisons to fail only on certain platforms or produce inconsistent test results.
When to Normalize and When to Strip
Normalize when the problem is inconsistent Unicode representation, such as NFC vs NFD. Strip characters only when they are truly unwanted in your workflow, such as zero-width spaces in copied content or BOM markers in imported text.
Common Mistakes
- Assuming visible text equals stored text
- Generating slugs before cleaning the title
- Comparing raw strings without normalization
- Blaming regex first when the input contains invisible characters
- Stripping all zero-width characters globally without considering scripts that require them (Persian, Indic languages)
FAQ
Can hidden characters affect SEO and slugs?
Yes. They can create unstable slug output, odd token boundaries, and inconsistent matching. A zero-width space in a page title can produce a slug with an invisible segment that breaks URL resolution on some servers.
Is a BOM always bad?
No, but it is often unwanted in application text, config payloads, and copy-paste workflows where hidden leading characters can break parsing. Some Windows tools expect a BOM in UTF-8 files, but most modern software does not require it.
Are combining accents hidden characters?
They are not invisible in the same way as zero-width characters, but they can still create hidden structural differences in text when stored separately.
How do zero-width characters get into my text?
The most common sources are copy-paste from web pages, word processors, CMS editors, and PDF extraction. Some rich text editors insert zero-width spaces to manage cursor positioning or word breaking internally.
Can I safely strip all hidden characters?
Not always. Zero-width joiners and non-joiners are meaningful in Persian, Arabic, and Indic scripts where they control ligature behavior. Stripping them in multilingual text can break correct rendering. Target only the specific characters that are problematic in your context.
Why does my JSON parsing fail on valid-looking data?
A common cause is a BOM (U+FEFF) at the start of the response body. JSON parsers expect the first character to be { or [, and the invisible BOM byte sequence causes a parse error. Strip the BOM before parsing.
Do hidden characters affect password or username validation?
Yes. A username with a zero-width space can pass visual inspection but fail exact-match lookups, or worse, allow two accounts that look identical. Always normalize and strip invisible characters from authentication-related fields.
Related Tools
- Text Analysis Tool to inspect suspicious character counts
- URL Slug Generator to create stable slugs after cleanup
- URL Encoder / Decoder to inspect encoded characters in URLs
- Regex Tester Online to test patterns for detecting hidden characters
Related Guides
- NFC vs NFD: Unicode Normalization Explained for understanding composed vs decomposed forms
- utf8proc NFC Normalization in C for normalizing text in C projects
- Regex Debugging Guide for troubleshooting patterns that fail on hidden characters