Firecrawl & Jina Web Scraping

Firecrawl vs WebFetch

Prefer firecrawl scrape URL --only-main-content over the WebFetch tool—it produces cleaner markdown, handles JavaScript-heavy pages, and avoids content truncation (>80% benchmark coverage). WebFetch is acceptable as a fallback when Firecrawl is unavailable.

bash

# Preferred approach:
firecrawl scrape https://docs.example.com/api --only-main-content

Token-Efficient Scraping

Inspired by Anthropic's dynamic filtering—always filter before reasoning. This reduced input tokens by ~24% and improved accuracy by ~11% in their benchmarks.

The Principle: Search → Filter → Scrape → Filter → Reason

DO:

Search (titles/URLs only) → Evaluate relevance → Scrape top hits → Filter by section → Reason

DON'T:

Search → Scrape everything → Reason over all of it

Step-by-Step Efficient Workflow

bash

# Step 1: Search — get titles/URLs only (cheap)
firecrawl search "query" --limit 20

# Step 2: Evaluate results, pick 3-5 best URLs

# Step 3: Scrape only those, filter to relevant sections
firecrawl scrape URL1 --only-main-content | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py \
  --sections "API,Authentication" --max-chars 5000

Post-Processing with filter_web_results.py

Pipe any Firecrawl or Exa output through this script to reduce context before reasoning:

bash

# Extract only matching sections from scraped page
firecrawl scrape URL --only-main-content | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"

# Keep only paragraphs with keywords
firecrawl search "query" --scrape --pretty | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000

# Extract specific JSON fields from API output
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000

# Combine filters with stats
firecrawl scrape URL --only-main-content | \
  python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats

Full path: python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py Flags: --sections, --keywords, --max-chars, --max-lines, --fields (JSON), --strip-links, --strip-images, --compact, --stats

Other Token-Saving Patterns

Use --only-main-content to strip navigation and footer boilerplate, reducing token consumption. Omit only when nav/footer content is specifically needed.
Use firecrawl map URL --search "topic" first to find relevant subpages before scraping
Use --format links first to get URL list, evaluate, then scrape selectively
Use --max-chars with exa_contents.py to cap extraction length
Use --formats summary (Python API script) over full text when you need the gist, not raw content

Claude API Native Tools (for API Agent Builders)

Anthropic's API now offers built-in dynamic filtering tools:

web_search_20260209 / web_fetch_20260209
Header: anthropic-beta: code-execution-web-tools-2026-02-09

These have built-in dynamic filtering via code execution. Use them when building Claude API agents directly. Use Firecrawl/Exa when you need: autonomous agents, batch scraping, structured extraction, domain-specific crawling, or when not on the Claude API.

Available Tools

1. Official Firecrawl CLI (`firecrawl`) — Primary

Setup: npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY

Command	Purpose	Quick Example
`scrape`	Single page → markdown	`firecrawl scrape URL --only-main-content`
`crawl`	Entire site with progress	`firecrawl crawl URL --wait --progress --limit 50`
`map`	Discover all URLs on a site	`firecrawl map URL --search "API"`
`search`	Web search (+ optional scrape)	`firecrawl search "query" --limit 10`

Full CLI reference: references/cli-reference.md

2. Auto-Save Alias (`fc-save`) — Shell Alias

Requires shell alias setup (not bundled with this skill).

bash

fc-save URL
# → Saves to ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md

3. Python API Script (`firecrawl_api.py`) — Advanced Features

Command: python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command> Requires: FIRECRAWL_API_KEY env var, pip install firecrawl-py requests

Command	Purpose	Quick Example
`search`	Web search with scraping	`firecrawl_api.py search "query" -n 10`
`scrape`	Single URL with page actions	`firecrawl_api.py scrape URL --formats markdown summary`
`batch-scrape`	Multiple URLs concurrently	`firecrawl_api.py batch-scrape URL1 URL2 URL3`
`crawl`	Website crawling	`firecrawl_api.py crawl URL --limit 20`
`map`	URL discovery	`firecrawl_api.py map URL --search "query"`
`extract`	LLM-powered structured extraction	`firecrawl_api.py extract URL --prompt "Find pricing"`
`agent`	Autonomous extraction (no URLs needed)	`firecrawl_api.py agent "Find YC W24 AI startups"`
`parallel-agent`	Bulk agent queries (v2.8.0+)	`firecrawl_api.py parallel-agent "Q1" "Q2" "Q3"`

Agent models: spark-1-fast (10 credits, simple), spark-1-mini (default), spark-1-pro (thorough)

Full Python API reference: references/python-api-reference.md

4. DeepWiki — GitHub Repo Documentation

bash

~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]

AI-generated wiki for any public GitHub repo. No API key required.

bash

# Overview
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat

# Browse sections
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc

# Specific section
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation

# Full dump for RAG
~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save

5. Jina Reader (`jina`) — Fallback

Use when Firecrawl fails or for Twitter/X URLs (Firecrawl blocks Twitter, Jina works).

bash

jina https://x.com/username/status/123456

Firecrawl vs Exa vs Native Claude Tools

Need	Best Tool	Why
Single page → markdown	`firecrawl scrape --only-main-content`	Cleanest output
Search + scrape in one shot	`firecrawl search --scrape`	Combined operation
Crawl entire site	`firecrawl crawl --wait --progress`	Link following + progress
Autonomous data finding	`firecrawl_api.py agent`	No URLs needed
Semantic/neural search	Exa `exa_search.py`	AI-powered relevance
Find research papers	Exa `--category "research paper"`	Academic index
Quick research answer	Exa `exa_research.py`	Citations + synthesis
Find similar pages	Exa `exa_similar.py`	Competitive analysis
Claude API agent building	Native `web_search_20260209`	Built-in dynamic filtering
Twitter/X content	`jina URL`	Only tool that works
GitHub repo docs	`deepwiki.sh owner/repo`	AI-generated wiki
Anti-bot / Cloudflare bypass	`scrapling` stealth fetch	Local Turnstile solver
Element-level extraction	`scrapling` + CSS selectors	Precision targeting, adaptive tracking
No API key scraping	`scrapling` HTTP fetch	100% local, no credentials
Site redesign resilience	`scrapling` adaptive mode	SQLite similarity matching

Common Workflows

Single Page Scraping

bash

firecrawl scrape https://example.com/page --only-main-content
# Or auto-save: fc-save URL
# Or to file: firecrawl scrape URL --only-main-content -o page.md

Documentation Crawling

bash

# Map first, then crawl relevant paths
firecrawl map https://docs.example.com --search "API"
firecrawl crawl https://docs.example.com --include-paths /api,/guides --wait --progress

Research Workflow

bash

firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdown

Agent-Powered Research (No URLs Needed)

bash

python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent \
  "Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"

Troubleshooting

bash

# Check status and credits
firecrawl --status && firecrawl credit-usage

# Re-authenticate
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY

# Check API key
echo $FIRECRAWL_API_KEY

Scrape fails: Try jina URL, or add --wait-for 3000 for JS-heavy sites
Async job stuck: Check with crawl-status/batch-status, cancel with crawl-cancel/batch-cancel
Disable telemetry: export FIRECRAWL_NO_TELEMETRY=1

Reference Documentation

File	Contents
`references/cli-reference.md`	Full CLI parameter reference (scrape, crawl, map, search, fc-save, jina, deepwiki)
`references/python-api-reference.md`	Full Python API script reference (all commands, SDK examples)
`references/firecrawl-api.md`	Firecrawl Search API reference
`references/firecrawl-agent-api.md`	Agent API (spark models, parallel agents, webhooks)
`references/actions-reference.md`	Page actions for dynamic content (click, write, wait, scroll)
`references/branding-format.md`	Brand identity extraction (colors, fonts, UI)

Test Suite

bash

python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick    # Quick validation
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py            # Full suite
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape  # Specific test

Search AI Tools

firecrawl

Install this agent skill to your Project

SKILL.md

Firecrawl & Jina Web Scraping

Firecrawl vs WebFetch

Token-Efficient Scraping

The Principle: Search → Filter → Scrape → Filter → Reason

Step-by-Step Efficient Workflow

Post-Processing with filter_web_results.py

Other Token-Saving Patterns

Claude API Native Tools (for API Agent Builders)

Available Tools

1. Official Firecrawl CLI (`firecrawl`) — Primary

2. Auto-Save Alias (`fc-save`) — Shell Alias

3. Python API Script (`firecrawl_api.py`) — Advanced Features

4. DeepWiki — GitHub Repo Documentation

5. Jina Reader (`jina`) — Fallback

Firecrawl vs Exa vs Native Claude Tools

Common Workflows

Single Page Scraping

Documentation Crawling

Research Workflow

Agent-Powered Research (No URLs Needed)

Troubleshooting

Reference Documentation

Test Suite

Search AI Tools

Install this agent skill to your Project

SKILL.md

Firecrawl & Jina Web Scraping

Firecrawl vs WebFetch

Token-Efficient Scraping

The Principle: Search → Filter → Scrape → Filter → Reason

Step-by-Step Efficient Workflow

Post-Processing with filter_web_results.py

Other Token-Saving Patterns

Claude API Native Tools (for API Agent Builders)

Available Tools

1. Official Firecrawl CLI (firecrawl) — Primary

2. Auto-Save Alias (fc-save) — Shell Alias

3. Python API Script (firecrawl_api.py) — Advanced Features

4. DeepWiki — GitHub Repo Documentation

5. Jina Reader (jina) — Fallback

Firecrawl vs Exa vs Native Claude Tools

Common Workflows

Single Page Scraping

Documentation Crawling

Research Workflow

Agent-Powered Research (No URLs Needed)

Troubleshooting

Reference Documentation

Test Suite

1. Official Firecrawl CLI (`firecrawl`) — Primary

2. Auto-Save Alias (`fc-save`) — Shell Alias

3. Python API Script (`firecrawl_api.py`) — Advanced Features

5. Jina Reader (`jina`) — Fallback