Agent skill

visitor-pattern-usage

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/visitor-pattern-usage

SKILL.md

Visitor Pattern Usage for html-to-markdown

Overview

The visitor pattern in html-to-markdown provides extensible hooks into the HTML-to-Markdown conversion pipeline. Custom visitors can inspect, modify, or replace the default conversion behavior for any of the 60+ HTML element types.

Architecture Philosophy

Key Principles:

  • Flexibility over performance: Give users full control over conversion
  • Zero-cost when unused: No overhead if visitor feature disabled
  • Comprehensive coverage: All HTML element types have dedicated visitor methods
  • Pre/post hooks: Both entry and exit points exposed for elements

Feature Gates

The visitor pattern is conditionally compiled:

rust
#[cfg(feature = "visitor")]
pub mod visitor;

#[cfg(feature = "async-visitor")]
pub use visitor_helpers::AsyncVisitorHandle;

In Cargo.toml:

toml
[features]
default = ["metadata"]
visitor = []
async-visitor = ["visitor", "dep:async-trait"]

Core Traits and Types

NodeType Enumeration

Located in /crates/html-to-markdown/src/visitor.rs, categorizes all HTML elements:

rust
pub enum NodeType {
    // Text content
    Text,

    // Block elements
    Heading,
    Paragraph,
    Div,
    Blockquote,
    Pre,
    Hr,

    // Lists
    List,               // ul, ol
    ListItem,           // li
    DefinitionList,     // dl
    DefinitionTerm,     // dt
    DefinitionDescription, // dd

    // Tables
    Table,
    TableRow,
    TableCell,
    TableHeader,
    TableBody,
    TableHead,
    TableFoot,

    // Inline formatting
    Link,
    Image,
    Strong,
    Em,
    Code,
    Strikethrough,
    Mark,
    Sub,
    Sup,
    LineBreak,
    Ruby,

    // Semantic HTML5
    Article,
    Section,
    Nav,
    Aside,
    Header,
    Footer,
    Main,

    // Media
    Audio,
    Video,
    Picture,
    Iframe,
    Svg,

    // Forms
    Input,
    Select,
    Button,
    Textarea,
    Fieldset,

    // Other
    Form,
    Label,
    Span,
    Generic(String),  // Unknown tags
}

VisitResult Enumeration

Specifies what action the conversion should take:

rust
pub enum VisitResult {
    /// Use default conversion for this element
    Default,

    /// Skip this element entirely (no output)
    Skip,

    /// Custom markdown for this element
    Custom(String),

    /// Process children normally, wrap with custom before/after
    Custom(String),  // Could also support Wrap variant

    /// Replace element content with custom markdown
    Replace(String),
}

NodeContext Structure

Provides context about the current node being visited:

rust
pub struct NodeContext {
    pub node_type: NodeType,
    pub tag_name: Option<String>,        // Actual HTML tag if element
    pub attributes: BTreeMap<String, String>,  // All HTML attributes
    pub parent_node_type: Option<NodeType>,    // Parent element type
    pub depth: usize,                    // Nesting depth
    pub position_in_parent: usize,       // Sibling index
}

HtmlVisitor Trait

The main visitor trait with methods for each element type:

rust
pub trait HtmlVisitor {
    // Generic element fallback
    fn visit_element(
        &mut self,
        ctx: &NodeContext,
        tag: &str,
        attributes: &BTreeMap<String, String>,
    ) -> VisitResult;

    // Text content
    fn visit_text(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;

    // Headings
    fn visit_heading(
        &mut self,
        ctx: &NodeContext,
        level: u8,  // 1-6
        text: &str,
    ) -> VisitResult;

    fn visit_paragraph(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;

    // Links and images
    fn visit_link(
        &mut self,
        ctx: &NodeContext,
        href: &str,
        text: &str,
        title: Option<&str>,
    ) -> VisitResult;

    fn visit_image(
        &mut self,
        ctx: &NodeContext,
        src: &str,
        alt: &str,
        title: Option<&str>,
    ) -> VisitResult;

    // Formatting
    fn visit_strong(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
    fn visit_em(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;
    fn visit_code(&mut self, ctx: &NodeContext, code: &str) -> VisitResult;
    fn visit_code_block(
        &mut self,
        ctx: &NodeContext,
        code: &str,
        language: Option<&str>,
    ) -> VisitResult;
    fn visit_strikethrough(&mut self, ctx: &NodeContext, text: &str) -> VisitResult;

    // Lists
    fn visit_list(
        &mut self,
        ctx: &NodeContext,
        ordered: bool,
        items: &[String],
    ) -> VisitResult;

    fn visit_list_item(
        &mut self,
        ctx: &NodeContext,
        content: &str,
        index: usize,
    ) -> VisitResult;

    // Tables
    fn visit_table(
        &mut self,
        ctx: &NodeContext,
        rows: &[Vec<String>],
    ) -> VisitResult;

    fn visit_table_cell(
        &mut self,
        ctx: &NodeContext,
        content: &str,
        is_header: bool,
    ) -> VisitResult;

    // ... and 40+ more visitor methods
}

Basic Example: Custom Link Converter

Convert all external links to plain text with URLs in parentheses:

rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};

#[derive(Debug)]
struct PlainTextLinkVisitor;

impl HtmlVisitor for PlainTextLinkVisitor {
    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        text: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        // Convert all links to plain text with URL
        VisitResult::Custom(format!("{} ({})", text, href))
    }

    // ... implement other visitor methods as Default
}

// Usage
let html = r#"<p>Visit <a href="https://example.com">our site</a></p>"#;
let mut visitor = PlainTextLinkVisitor;
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;
// Output: Visit our site (https://example.com)

Advanced Example: Custom Code Block Highlighter

Highlight code blocks with language-specific syntax:

rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};

#[derive(Debug)]
struct HighlightingVisitor;

impl HtmlVisitor for HighlightingVisitor {
    fn visit_code_block(
        &mut self,
        _ctx: &NodeContext,
        code: &str,
        language: Option<&str>,
    ) -> VisitResult {
        match language {
            Some("python") => {
                // Custom Python highlighting
                VisitResult::Custom(format!(
                    "```python\n<!-- HIGHLIGHTED -->\n{}\n```",
                    code
                ))
            }
            Some("rust") => {
                // Custom Rust highlighting
                VisitResult::Custom(format!(
                    "```rust\n<!-- WITH SYNTAX HIGHLIGHTING -->\n{}\n```",
                    code
                ))
            }
            _ => VisitResult::Default,  // Use default for other languages
        }
    }

    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        text: &str,
        title: Option<&str>,
    ) -> VisitResult {
        // Links in documentation: add reference-style syntax
        VisitResult::Custom(format!("[{}][{}]", text, href))
    }

    fn visit_heading(
        &mut self,
        _ctx: &NodeContext,
        level: u8,
        text: &str,
    ) -> VisitResult {
        // Add anchor links to all headings
        let id = text.to_lowercase().replace(' ', '-');
        VisitResult::Custom(format!(
            "{} {{#{}}}\n",
            "#".repeat(level as usize),
            id
        ))
    }
}

// Usage
let html = r#"
<h1>Documentation</h1>
<p>See <a href="https://docs.rs">our docs</a></p>
<pre><code class="language-rust">fn main() {}</code></pre>
"#;
let mut visitor = HighlightingVisitor;
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;

Filtering by Node Type

Visit only specific element types:

rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult, NodeType};

#[derive(Debug)]
struct ImageOnlyVisitor {
    image_count: usize,
}

impl HtmlVisitor for ImageOnlyVisitor {
    fn visit_image(
        &mut self,
        _ctx: &NodeContext,
        src: &str,
        alt: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        self.image_count += 1;
        println!("Image {}: {} ({})", self.image_count, alt, src);

        // Could extract images to separate directory
        VisitResult::Custom(format!("![{}]({})", alt, src))
    }

    fn visit_text(&mut self, _ctx: &NodeContext, _text: &str) -> VisitResult {
        VisitResult::Skip  // Skip all text, only output images
    }
}

// Usage
let mut visitor = ImageOnlyVisitor { image_count: 0 };
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;
println!("Found {} images", visitor.image_count);

Context-Aware Transformations

Use parent context and depth to transform based on structure:

rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult, NodeType};

#[derive(Debug)]
struct DepthTrackingVisitor {
    current_depth: usize,
}

impl HtmlVisitor for DepthTrackingVisitor {
    fn visit_paragraph(
        &mut self,
        ctx: &NodeContext,
        text: &str,
    ) -> VisitResult {
        // Different formatting based on depth
        match ctx.depth {
            0 => VisitResult::Custom(format!("**{}**\n", text)),  // Bold at top level
            1 => VisitResult::Custom(format!("*{}*\n", text)),    // Italic nested once
            _ => VisitResult::Default,                            // Normal elsewhere
        }
    }

    fn visit_link(
        &mut self,
        ctx: &NodeContext,
        href: &str,
        text: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        // Links in blockquotes get footnote style
        if let Some(NodeType::Blockquote) = ctx.parent_node_type {
            VisitResult::Custom(format!("{}[^{}]", text, href))
        } else {
            VisitResult::Default
        }
    }
}

// Usage
let mut visitor = DepthTrackingVisitor { current_depth: 0 };
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;

Attribute-Based Routing

Route handling based on HTML attributes:

rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};

#[derive(Debug)]
struct AttributeRoutingVisitor;

impl HtmlVisitor for AttributeRoutingVisitor {
    fn visit_link(
        &mut self,
        ctx: &NodeContext,
        href: &str,
        text: &str,
        title: Option<&str>,
    ) -> VisitResult {
        // Custom handling for data attributes
        if let Some(tracking_id) = ctx.attributes.get("data-tracking-id") {
            return VisitResult::Custom(format!(
                "[{}]({} \"{}\")",
                text,
                href,
                tracking_id
            ));
        }

        // Skip links marked with data-skip="true"
        if ctx.attributes.get("data-skip").map_or(false, |v| v == "true") {
            return VisitResult::Skip;
        }

        VisitResult::Default
    }

    fn visit_paragraph(
        &mut self,
        ctx: &NodeContext,
        text: &str,
    ) -> VisitResult {
        // Blockquote paragraphs differently
        if ctx.attributes.get("data-featured") == Some(&"true".to_string()) {
            VisitResult::Custom(format!("> {}\n", text))
        } else {
            VisitResult::Default
        }
    }
}

// Usage
let html = r#"
<a href="/page" data-tracking-id="click-001">Track me</a>
<a href="/skip" data-skip="true">Skip me</a>
<p data-featured="true">Important paragraph</p>
"#;
let mut visitor = AttributeRoutingVisitor;
let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;

Async Visitor Support

For languages with native async/await (Python, TypeScript, Elixir):

rust
#[cfg(feature = "async-visitor")]
pub async fn convert_with_async_visitor(
    html: &str,
    options: Option<ConversionOptions>,
    visitor: Option<AsyncVisitorHandle>,
) -> Result<String> { ... }

Python Async Example (PyO3)

python
import asyncio
import html_to_markdown

class AsyncSyntaxHighlighter:
    async def visit_code_block(self, ctx, code, language):
        # Call async syntax highlighting service
        highlighted = await highlight_service.highlight(code, language)
        return f"```{language}\n{highlighted}\n```"

    async def visit_link(self, ctx, href, text, title):
        # Check external link status asynchronously
        is_valid = await check_link_validity(href)
        if is_valid:
            return f"[{text}]({href})"
        else:
            return f"~~[{text}]({href})~~ (broken)"

# Usage
markdown = await html_to_markdown.convert_with_async_visitor(
    html,
    None,
    AsyncSyntaxHighlighter()
)

TypeScript Async Example (NAPI-RS)

typescript
import { convertWithAsyncVisitor } from 'html-to-markdown';

class AsyncContentProcessor {
    async visitLink(ctx, href, text, title) {
        // Fetch metadata for link
        const metadata = await fetch(href).then(r => r.json());
        return `[${text}](${href} "${metadata.title}")`;
    }

    async visitImage(ctx, src, alt, title) {
        // Optimize image
        const optimized = await imageOptimizer.optimize(src);
        return `![${alt}](${optimized})`;
    }
}

const markdown = await convertWithAsyncVisitor(html, undefined, new AsyncContentProcessor());

State Management in Visitors

Maintain state across multiple visits:

rust
use html_to_markdown_rs::visitor::{HtmlVisitor, NodeContext, VisitResult};
use std::collections::HashSet;

#[derive(Debug)]
struct LinkCollectorVisitor {
    external_links: HashSet<String>,
    email_links: HashSet<String>,
    internal_links: HashSet<String>,
}

impl HtmlVisitor for LinkCollectorVisitor {
    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        _text: &str,
        _title: Option<&str>,
    ) -> VisitResult {
        if href.starts_with("mailto:") {
            self.email_links.insert(href.to_string());
        } else if href.starts_with("http") {
            self.external_links.insert(href.to_string());
        } else {
            self.internal_links.insert(href.to_string());
        }

        VisitResult::Default  // Keep default link formatting
    }
}

// Usage
let mut visitor = LinkCollectorVisitor {
    external_links: HashSet::new(),
    email_links: HashSet::new(),
    internal_links: HashSet::new(),
};

let markdown = convert_with_visitor(html, None, Some(&mut visitor))?;

println!("External: {:?}", visitor.external_links);
println!("Email: {:?}", visitor.email_links);
println!("Internal: {:?}", visitor.internal_links);

Performance Considerations

Impact on Conversion Speed

  • Visitor trait calls add function dispatch overhead
  • For large documents (10K+ elements), overhead ~5-10%
  • Use feature gate to eliminate entirely if not needed

Optimization Strategies

  1. Fast path for most elements:

    rust
    fn visit_text(&mut self, _ctx: &NodeContext, _text: &str) -> VisitResult {
        VisitResult::Default  // Quick return for most text nodes
    }
    
  2. Only override when needed:

    rust
    // Only override link handling
    // All other methods inherit Default implementation
    
  3. Avoid allocations in hot path:

    rust
    // Bad: allocate string for every node
    VisitResult::Custom(format!(">{}<", text))
    
    // Better: pre-allocate or use Cow
    let mut result = String::with_capacity(text.len() + 2);
    result.push('>');
    result.push_str(text);
    result.push('<');
    VisitResult::Custom(result)
    

Integration with Conversion Options

Visitors work alongside ConversionOptions:

rust
use html_to_markdown_rs::{ConversionOptions, HeadingStyle};

let options = ConversionOptions {
    heading_style: HeadingStyle::AtxClosed,  // User preference
    wrap: true,
    wrap_width: 80,
    ..Default::default()
};

// Visitor can override specific behaviors
let mut visitor = CustomVisitor;
let markdown = convert_with_visitor(html, Some(options), Some(&mut visitor))?;

Priority: Visitor always takes precedence. If visitor returns Custom or Skip, conversion options are bypassed for that element.

Error Handling in Visitors

The visitor pattern doesn't support errors directly. Return Default or Skip instead:

rust
impl HtmlVisitor for SafeVisitor {
    fn visit_link(
        &mut self,
        _ctx: &NodeContext,
        href: &str,
        text: &str,
        title: Option<&str>,
    ) -> VisitResult {
        // Can't return error, so validate and fallback
        if href.is_empty() {
            return VisitResult::Custom(text.to_string());  // Fallback to text
        }

        VisitResult::Default
    }
}

Testing Visitors

Located in binding test suites (Python, TypeScript, Ruby, PHP):

bash
# Test visitor feature
task rust:test  # Includes visitor tests

# Binding-specific visitor tests
task python:test  # tests/test_visitor.py
task typescript:test  # packages/typescript/tests/visitor.spec.ts
task ruby:test  # packages/ruby/spec/visitor_spec.rb

Implementation Location

Core Files:

  • /crates/html-to-markdown/src/visitor.rs - Trait definitions and NodeType enum
  • /crates/html-to-markdown/src/visitor_helpers.rs - VisitorHandle and async support
  • /crates/html-to-markdown/src/converter.rs - Integration with conversion pipeline

Binding Examples:

  • /crates/html-to-markdown-py/src/lib.rs - PyO3 visitor wrapping
  • /crates/html-to-markdown-node/src/lib.rs - NAPI-RS visitor support
  • /packages/ruby/lib/visitor.rb - Ruby visitor interface
  • /packages/php/src/Visitor.php - PHP visitor base class

API Pattern

rust
// Simple visitor (sync)
pub fn convert_with_visitor(
    html: &str,
    options: Option<ConversionOptions>,
    visitor: Option<visitor::VisitorHandle>,
) -> Result<String>

// Async visitor (for languages with native async)
#[cfg(feature = "async-visitor")]
pub async fn convert_with_async_visitor(
    html: &str,
    options: Option<ConversionOptions>,
    visitor: Option<AsyncVisitorHandle>,
) -> Result<String>

// Combined with metadata (future enhancement)
// pub fn convert_with_metadata_and_visitor(...) -> Result<(String, ExtendedMetadata)>

Quick Reference: Common Visitor Patterns

Use Case Implementation
Skip certain elements Return VisitResult::Skip
Modify element output Return VisitResult::Custom(new_markdown)
Track state Use &mut self fields to accumulate data
Conditional routing Use ctx fields (parent, depth, attributes)
Preserve default Return VisitResult::Default
Context-aware Match on ctx.parent_node_type, ctx.depth
Attribute-based Read from ctx.attributes map
Stateless transformation Implement stateless visitor struct

Didn't find tool you were looking for?

Be as detailed as possible for better results