Agent skill

scrapling

Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python.

View SKILL.md on GitHub Repository

Stars 56,643

Forks 7,481

Install this agent skill to your Project

npx add-skill https://github.com/NousResearch/hermes-agent/tree/main/optional-skills/research/scrapling

Metadata

Additional technical details for this skill

hermes: { "tags": [ "Web Scraping", "Browser", "Cloudflare", "Stealth", "Crawling", "Spider" ], "homepage": "https://github.com/D4Vinci/Scrapling", "related_skills": [ "duckduckgo-search", "domain-intel" ] }

SKILL.md

Scrapling

Scrapling is a web scraping framework with anti-bot bypass, stealth browser automation, and a spider framework. It provides three fetching strategies (HTTP, dynamic JS, stealth/Cloudflare) and a full CLI.

This skill is for educational and research purposes only. Users must comply with local/international data scraping laws and respect website Terms of Service.

When to Use

Scraping static HTML pages (faster than browser tools)
Scraping JS-rendered pages that need a real browser
Bypassing Cloudflare Turnstile or bot detection
Crawling multiple pages with a spider
When the built-in web_extract tool does not return the data you need

Installation

bash

pip install "scrapling[all]"
scrapling install

Minimal install (HTTP only, no browser):

bash

pip install scrapling

With browser automation only:

bash

pip install "scrapling[fetchers]"
scrapling install

Quick Reference

Approach	Class	Use When
HTTP	`Fetcher` / `FetcherSession`	Static pages, APIs, fast bulk requests
Dynamic	`DynamicFetcher` / `DynamicSession`	JS-rendered content, SPAs
Stealth	`StealthyFetcher` / `StealthySession`	Cloudflare, anti-bot protected sites
Spider	`Spider`	Multi-page crawling with link following

CLI Usage

Extract Static Page

bash

scrapling extract get 'https://example.com' output.md

With CSS selector and browser impersonation:

bash

scrapling extract get 'https://example.com' output.md \
  --css-selector '.content' \
  --impersonate 'chrome'

Extract JS-Rendered Page

bash

scrapling extract fetch 'https://example.com' output.md \
  --css-selector '.dynamic-content' \
  --disable-resources \
  --network-idle

Extract Cloudflare-Protected Page

bash

scrapling extract stealthy-fetch 'https://protected-site.com' output.html \
  --solve-cloudflare \
  --block-webrtc \
  --hide-canvas

POST Request

bash

scrapling extract post 'https://example.com/api' output.json \
  --json '{"query": "search term"}'

Output Formats

The output format is determined by the file extension:

.html -- raw HTML
.md -- converted to Markdown
.txt -- plain text
.json / .jsonl -- JSON

Python: HTTP Scraping

Single Request

python

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
for q in quotes:
    print(q)

Session (Persistent Cookies)

python

from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate='chrome') as session:
    page = session.get('https://example.com/', stealthy_headers=True)
    links = page.css('a::attr(href)').getall()
    for link in links[:5]:
        sub = session.get(link)
        print(sub.css('h1::text').get())

POST / PUT / DELETE

python

page = Fetcher.post('https://api.example.com/data', json={"key": "value"})
page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"})
page = Fetcher.delete('https://api.example.com/item/1')

With Proxy

python

page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')

Python: Dynamic Pages (JS-Rendered)

For pages that require JavaScript execution (SPAs, lazy-loaded content):

python

from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch('https://example.com', headless=True)
data = page.css('.js-loaded-content::text').getall()

Wait for Specific Element

python

page = DynamicFetcher.fetch(
    'https://example.com',
    wait_selector=('.results', 'visible'),
    network_idle=True,
)

Disable Resources for Speed

Blocks fonts, images, media, stylesheets (~25% faster):

python

from scrapling.fetchers import DynamicSession

with DynamicSession(headless=True, disable_resources=True, network_idle=True) as session:
    page = session.fetch('https://example.com')
    items = page.css('.item::text').getall()

Custom Page Automation

python

from playwright.sync_api import Page
from scrapling.fetchers import DynamicFetcher

def scroll_and_click(page: Page):
    page.mouse.wheel(0, 3000)
    page.wait_for_timeout(1000)
    page.click('button.load-more')
    page.wait_for_selector('.extra-results')

page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click)
results = page.css('.extra-results .item::text').getall()

Python: Stealth Mode (Anti-Bot Bypass)

For Cloudflare-protected or heavily fingerprinted sites:

python

from scrapling.fetchers import StealthyFetcher

page = StealthyFetcher.fetch(
    'https://protected-site.com',
    headless=True,
    solve_cloudflare=True,
    block_webrtc=True,
    hide_canvas=True,
)
content = page.css('.protected-content::text').getall()

Stealth Session

python

from scrapling.fetchers import StealthySession

with StealthySession(headless=True, solve_cloudflare=True) as session:
    page1 = session.fetch('https://protected-site.com/page1')
    page2 = session.fetch('https://protected-site.com/page2')

Element Selection

All fetchers return a Selector object with these methods:

CSS Selectors

python

page.css('h1::text').get()              # First h1 text
page.css('a::attr(href)').getall()      # All link hrefs
page.css('.quote .text::text').getall() # Nested selection

XPath

python

page.xpath('//div[@class="content"]/text()').getall()
page.xpath('//a/@href').getall()

Find Methods

python

page.find_all('div', class_='quote')       # By tag + attribute
page.find_by_text('Read more', tag='a')    # By text content
page.find_by_regex(r'\$\d+\.\d{2}')       # By regex pattern

Similar Elements

Find elements with similar structure (useful for product listings, etc.):

python

first_product = page.css('.product')[0]
all_similar = first_product.find_similar()

Navigation

python

el = page.css('.target')[0]
el.parent                # Parent element
el.children              # Child elements
el.next_sibling          # Next sibling
el.prev_sibling          # Previous sibling

Python: Spider Framework

For multi-page crawling with link following:

python

from scrapling.spiders import Spider, Request, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10
    download_delay = 1

    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
                "tags": quote.css('.tag::text').getall(),
            }

        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")

Multi-Session Spider

Route requests to different fetcher types:

python

from scrapling.fetchers import FetcherSession, AsyncStealthySession

class SmartSpider(Spider):
    name = "smart"
    start_urls = ["https://example.com/"]

    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)

    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "protected" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast", callback=self.parse)

Pause/Resume Crawling

python

spider = QuotesSpider(crawldir="./crawl_checkpoint")
spider.start()  # Ctrl+C to pause, re-run to resume from checkpoint

Pitfalls

Browser install required: run scrapling install after pip install -- without it, DynamicFetcher and StealthyFetcher will fail
Timeouts: DynamicFetcher/StealthyFetcher timeout is in milliseconds (default 30000), Fetcher timeout is in seconds
Cloudflare bypass: solve_cloudflare=True adds 5-15 seconds to fetch time -- only enable when needed
Resource usage: StealthyFetcher runs a real browser -- limit concurrent usage
Legal: always check robots.txt and website ToS before scraping. This library is for educational and research purposes
Python version: requires Python 3.10+

Maintainer

NousResearch Core maintainer

Source details

Full Name: NousResearch/hermes-agent
Branch: main
Path in repo: optional-skills/research/scrapling
License: MIT License
Topics: ai claude-code anthropic claude ai-agents clawdbot llm openclaw ai-agent codex chatgpt moltbot openai hermes hermes-agent nous-research

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

NousResearch/hermes-agent

agentmail

Give the agent its own dedicated email inbox via AgentMail. Send, receive, and manage email autonomously using agent-owned email addresses (e.g. hermes-agent@agentmail.to).

56,643 7,481

Explore

NousResearch/hermes-agent

base

Query Base (Ethereum L2) blockchain data with USD pricing — wallet balances, token info, transaction details, gas analysis, contract inspection, whale detection, and live network stats. Uses Base RPC + CoinGecko. No API key required.

56,643 7,481

Explore

NousResearch/hermes-agent

solana

Query Solana blockchain data with USD pricing — wallet balances, token portfolios with values, transaction details, NFTs, whale detection, and live network stats. Uses Solana RPC + CoinGecko. No API key required.

56,643 7,481

Explore

NousResearch/hermes-agent

one-three-one-rule

Structured decision-making framework for technical proposals and trade-off analysis. When the user faces a choice between multiple approaches (architecture decisions, tool selection, refactoring strategies, migration paths), this skill produces a 1-3-1 format: one clear problem statement, three distinct options with pros/cons, and one concrete recommendation with definition of done and implementation plan. Use when the user asks for a "1-3-1", says "give me options", or needs help choosing between competing approaches.

56,643 7,481

Explore

NousResearch/hermes-agent

fastmcp

Build, test, inspect, install, and deploy MCP servers with FastMCP in Python. Use when creating a new MCP server, wrapping an API or database as MCP tools, exposing resources or prompts, or preparing a FastMCP server for Claude Code, Cursor, or HTTP deployment.

56,643 7,481

Explore

NousResearch/hermes-agent

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

56,643 7,481

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

Metadata

SKILL.md

Scrapling

When to Use

Installation

Quick Reference

CLI Usage

Extract Static Page

Extract JS-Rendered Page

Extract Cloudflare-Protected Page

POST Request

Output Formats

Python: HTTP Scraping

Single Request

Session (Persistent Cookies)

POST / PUT / DELETE

With Proxy

Python: Dynamic Pages (JS-Rendered)

Wait for Specific Element

Disable Resources for Speed

Custom Page Automation

Python: Stealth Mode (Anti-Bot Bypass)

Stealth Session

Element Selection

CSS Selectors

XPath

Find Methods

Similar Elements

Navigation

Python: Spider Framework

Multi-Session Spider

Pause/Resume Crawling

Pitfalls

Recommended Agent Skills

agentmail

base

solana

one-three-one-rule

fastmcp

qdrant-vector-search