Agent skill

crawl4ai

Use when crawling web pages, extracting markdown content, or scraping website data with intelligent chunking and skeleton planning. Use when the user provides a URL or link to fetch or crawl.

View SKILL.md on GitHub Repository

Stars 9

Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/tao3k/xiuxian-artisan-workshop/tree/main/skills/crawl4ai

Metadata

Additional technical details for this skill

author: xiuxian-artisan-workshop
source: https://github.com/unclecode/crawl4ai
intents: [ "Crawl a web page", "Extract markdown content", "Scrape website data", "Open or fetch a URL or link", "Research a URL or link", "Help me research a web page", "Perform deep crawl", "Get document skeleton/TOC", "Extract specific sections from web page" ]
version: 0.2.1
routing keywords: [ "crawl", "web", "scrape", "extract", "fetch", "url", "link", "research", "research url", "analyze page", "html", "markdown", "content", "scraper", "crawler", "web scraping", "web crawl", "page content", "web extraction" ]

SKILL.md

crawl4ai

High-performance web crawler with intelligent chunking. Crawls web pages and extracts content as markdown using LLM-based skeleton planning.

Commands

`crawl_url` (alias: `webCrawl`)

Crawl a web page with native workflow execution and LLM-based intelligent chunking.

Parameters:

Parameter	Type	Default	Description
`url`	str	-	Target URL to crawl (required)
`action`	str	"smart"	Action mode: "smart", "skeleton", "crawl"
`fit_markdown`	bool	true	Clean and simplify markdown output
`max_depth`	int	0	Maximum crawling depth (0=single page)
`return_skeleton`	bool	false	Also return document skeleton (TOC)
`chunk_indices`	list[int]	-	List of section indices to extract

Action Modes:

Mode	Description	Use Case
`smart` (default)	LLM generates chunk plan, then extracts relevant sections	Large docs where you need specific info
`skeleton`	Extract lightweight TOC without full content	Quick overview, decide what to read
`crawl`	Return full markdown content	Small pages, complete content needed

Runtime Transport:

max_depth = 0: Uses HTTP strategy (no browser cold-start) for lower latency.
max_depth > 0: Uses browser deep-crawl strategy (BFS) for multi-page traversal.
file://... with max_depth = 0: Uses local fast-path (no crawl4ai runtime bootstrap) for deterministic fixture/local-note benchmarking.
Persistent worker mode reuses the HTTP crawler instance across requests to reduce repeated initialization cost.

Examples:

python

# Smart crawl with LLM chunking (default)
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com"}`

# Skeleton only - get TOC quickly
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "action": "skeleton"}`

# Full content crawl
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "action": "crawl"}`

# Extract specific sections
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "chunk_indices": [0, 1, 2]}`

# Deep crawl (follow links up to depth N)
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "max_depth": 2}`

# Get skeleton with full content
tool: `crawl4ai.CrawlUrl` with `{"url": "https://example.com", "return_skeleton": true}`

Core Concepts

Topic	Description	Reference
Skeleton Planning	LLM sees TOC (~500 tokens) not full content (~10k+)	smart-chunking.md
Chunk Extraction	Token-aware section extraction	chunking.md
Deep Crawling	Multi-page crawling with BFS strategy	deep-crawl.md

Best Practices

Use skeleton mode first for large documents to understand structure
Use chunk_indices to extract specific sections instead of full content
Set max_depth > 0 carefully - limits pages crawled to prevent runaway crawling
Keep fit_markdown=true for cleaner output, false for raw content