Agent skills
homoglyph-detector

Agent skill

homoglyph-detector

Byte-level Unicode homoglyph detection for identifying invisible character substitutions in code

Stars 514

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/a5c-ai/babysitter/tree/main/library/specializations/security-compliance/skills/homoglyph-detector

SKILL.md

Homoglyph Detector

Byte-level forensic analysis of code changes to detect Unicode homoglyph substitutions — characters that look identical to ASCII in every editor and diff tool but have different codepoints, silently breaking string comparisons, dictionary lookups, and identifier resolution.

Purpose

Homoglyph attacks (related to CVE-2021-42574 "Trojan Source") are the highest-stealth trojan technique. A Cyrillic р (U+0440) looks identical to a Latin p (U+0070) in every font, editor, and diff viewer. The only way to detect it is byte-level analysis via hexdump.

This skill pipes git diffs through hexdump -C and scans for multi-byte UTF-8 sequences where single-byte ASCII is expected, particularly in string literals used as dictionary keys, variable names, and identifiers.

Capabilities

Confusable Character Detection

Scans for these high-risk Unicode confusables:

Latin	Cyrillic	Greek	UTF-8 Bytes
a (61)	а (D0 B0)	α (CE B1)	1 vs 2 bytes
c (63)	с (D1 81)	—	1 vs 2 bytes
e (65)	е (D0 B5)	ε (CE B5)	1 vs 2 bytes
o (6F)	о (D0 BE)	ο (CE BF)	1 vs 2 bytes
p (70)	р (D1 80)	ρ (CF 81)	1 vs 2 bytes
x (78)	х (D1 85)	χ (CF 87)	1 vs 2 bytes
y (79)	у (D1 83)	—	1 vs 2 bytes

Zero-Width Character Detection

U+200B — Zero-width space
U+200C — Zero-width non-joiner
U+200D — Zero-width joiner
U+FEFF — Byte order mark (in non-BOM position)

Bidi Control Character Detection (Trojan Source)

U+200F — Right-to-left mark
U+200E — Left-to-right mark
U+202A — Left-to-right embedding
U+202B — Right-to-left embedding
U+202C — Pop directional formatting
U+2066 — Left-to-right isolate
U+2067 — Right-to-left isolate

Context-Aware Analysis

Focuses on string literals (dictionary keys, config values)
Focuses on identifiers (variable names, function names, class names)
Ignores legitimate Unicode in comments, docstrings, and i18n strings
Compares byte patterns between removed (-) and added (+) diff lines

Input Schema

json

{
  "type": "object",
  "required": ["projectRoot", "changedFiles"],
  "properties": {
    "projectRoot": {
      "type": "string",
      "description": "Absolute path to the git repository"
    },
    "changedFiles": {
      "type": "array",
      "items": { "type": "string" },
      "description": "List of changed file paths to scan"
    },
    "scanMode": {
      "type": "string",
      "enum": ["uncommitted", "commit-range", "branch-diff"],
      "default": "uncommitted"
    },
    "baseRef": { "type": "string" },
    "headRef": { "type": "string" }
  }
}

Output Schema

json

{
  "type": "object",
  "required": ["filesScanned", "homoglyphsFound", "verdict"],
  "properties": {
    "filesScanned": { "type": "number" },
    "homoglyphsFound": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "file": { "type": "string" },
          "line": { "type": "number" },
          "byteOffset": { "type": "string" },
          "context": { "type": "string" },
          "expectedAscii": { "type": "string" },
          "actualBytes": { "type": "string" },
          "unicodeCodepoint": { "type": "string" },
          "scriptName": { "type": "string" },
          "impact": { "type": "string" }
        }
      }
    },
    "bidiControlChars": { "type": "array" },
    "verdict": {
      "type": "string",
      "enum": ["CLEAN", "HOMOGLYPH_DETECTED"]
    }
  }
}

Detection Method

bash

# Step 1: Pipe git diff through hexdump
git diff <file> | hexdump -C

# Step 2: In added (+) lines, look for multi-byte sequences
# where the removed (-) line had single-byte ASCII
#
# Example — Latin 'p' vs Cyrillic 'р':
# Removed: 22 70 70 67 22   |  "ppg"  |   ← 70 = Latin 'p'
# Added:   22 d1 80 70 67   |  "..pg" |   ← d1 80 = Cyrillic 'р'
#
# The d1 80 bytes where 70 should be = HOMOGLYPH DETECTED

Usage Example

javascript

skill: {
  name: 'homoglyph-detector',
  context: {
    projectRoot: '/path/to/project',
    changedFiles: ['backend/app/prediction/temporal.py'],
    scanMode: 'uncommitted'
  }
}

Real-World Example

From adversarial drill #6:

Attack: Dictionary key "ppg" changed to "рpg" (Cyrillic р + Latin pg)
Camouflage: 4 lines of harmless round() wrappers added as decoy
Impact: All dict.get("ppg") lookups return default 0, disabling trend detection
Detection: hexdump -C revealed bytes d1 80 where 70 was expected

Process Files

nation-state-trojan-detection.js — Phase 2: Homoglyph Detection (parallel with semantic analysis)

Maintainer

a5c-ai Core maintainer

Source details

Full Name: a5c-ai/babysitter
Branch: main
Path in repo: library/specializations/security-compliance/skills/homoglyph-detector
License: MIT License
Topics: claude-code agent-skills claude-code-skills ai-agents claude-skills vibe-coding agentic-workflow agentic-ai ai-automation agent-orchestration babysitter trustworthy-ai

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

a5c-ai/babysitter

gsd-tools

Central utility skill for GSD operations. Provides config parsing, slug generation, timestamps, path operations, and orchestrates calls to other specialized skills. Acts as the unified entry point that the original gsd-tools.cjs provided via its lib/ modules (commands, config, core, init).

514 31

Explore

a5c-ai/babysitter

model-profile-resolution

Resolve model profile (quality/balanced/budget) at orchestration start and map agents to specific models. Enables cost/quality tradeoffs by selecting appropriate AI models for each agent role.

514 31

Explore

a5c-ai/babysitter

verification-suite

Plan structure validation, phase completeness checks, reference integrity verification, and artifact existence confirmation. Provides the structured verification layer ensuring GSD artifacts are well-formed and complete.

514 31

Explore

a5c-ai/babysitter

state-management

STATE.md reading, writing, and field-level updates. Provides cross-session state persistence via .planning/STATE.md with structured fields for current task, completed phases, blockers, decisions, and quick tasks.

514 31

Explore

a5c-ai/babysitter

git-integration

Git commit patterns, formats, and conventions for GSD methodology. Provides atomic commits per task, structured commit messages, planning file commits, branch management, and milestone tag operations.

514 31

Explore

a5c-ai/babysitter

frontmatter-parsing

YAML frontmatter parsing and manipulation for .planning/ documents. Provides read, write, update, query, and validation operations on frontmatter blocks in GSD markdown artifacts.

514 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Homoglyph Detector

Purpose

Capabilities

Confusable Character Detection

Zero-Width Character Detection

Bidi Control Character Detection (Trojan Source)

Context-Aware Analysis

Input Schema

Output Schema

Detection Method

Usage Example

Real-World Example

Process Files

Recommended Agent Skills

gsd-tools

model-profile-resolution

verification-suite

state-management

git-integration

frontmatter-parsing