harvest-deep-crawl

Multi-page deep crawling - documentation sites, wikis, knowledge bases

View SKILL.md on GitHub Repository

Stars 458

Forks 38

Install this agent skill to your Project

npx add-skill https://github.com/vibeeval/vibecosystem/tree/main/skills/harvest-deep-crawl

SKILL.md

Harvest Deep Crawl

Crawl multi-page websites following internal links to a specified depth. Ideal for building complete knowledge bases from documentation sites, wikis, and reference materials.

Usage

/crawl <url> --depth <N>

Examples

bash

# Crawl docs site 3 levels deep
/crawl https://docs.example.com --depth 3

# Crawl a specific section
/crawl https://docs.example.com/api --depth 2

# Crawl with page limit
/crawl https://wiki.example.com --depth 5 --max-pages 50

Parameters

Param	Default	Description
`--depth`	2	Max link-following depth
`--max-pages`	100	Max pages to crawl
`--same-domain`	true	Stay on same domain
`--include`	*	URL pattern to include
`--exclude`	-	URL pattern to exclude

How It Works

Start at root URL, extract all internal links
Follow links up to specified depth (BFS order)
Extract content from each page
Deduplicate pages with > 90% content overlap
Build table of contents from page hierarchy
Merge into coherent knowledge base
Save to .claude/cache/agents/harvest/crawl-{domain}/

Output Structure

crawl-{domain}-{timestamp}/
  index.md          # Table of contents + summary
  page-001.md       # First page content
  page-002.md       # Second page content
  ...
  metadata.json     # Crawl stats, URLs, timings

Crawl Engine

Primary: crawl4ai (Docker port 11235)

bash

curl -s http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.example.com"],
    "max_depth": 3,
    "same_domain": true,
    "word_count_threshold": 50
  }'