Agent skill
ocr-and-documents
Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.
Install this agent skill to your Project
npx add-skill https://github.com/NousResearch/hermes-agent/tree/main/skills/productivity/ocr-and-documents
Metadata
Additional technical details for this skill
- hermes
-
{ "tags": [ "PDF", "Documents", "Research", "Arxiv", "Text-Extraction", "OCR" ], "related_skills": [ "powerpoint" ] }
SKILL.md
PDF & Document Extraction
For DOCX: use python-docx (parses actual document structure, far better than OCR).
For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support).
This skill covers PDFs and scanned documents.
Step 1: Remote URL Available?
If the document has a URL, always try web_extract first:
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])
This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
Step 2: Choose Local Extractor
| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---|---|---|
| Text-based PDF | ✅ | ✅ |
| Scanned PDF (OCR) | ❌ | ✅ (90+ languages) |
| Tables | ✅ (basic) | ✅ (high accuracy) |
| Equations / LaTeX | ❌ | ✅ |
| Code blocks | ❌ | ✅ |
| Forms | ❌ | ✅ |
| Headers/footers removal | ❌ | ✅ |
| Reading order detection | ❌ | ✅ |
| Images extraction | ✅ (embedded) | ✅ (with context) |
| Images → text (OCR) | ❌ | ✅ |
| EPUB | ✅ | ✅ |
| Markdown output | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| Install size | ~25MB | ~3-5GB (PyTorch + models) |
| Speed | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
If the user needs marker capabilities but the system lacks ~5GB free disk:
"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
pymupdf (lightweight)
pip install pymupdf pymupdf4llm
Via helper script:
python scripts/extract_pymupdf.py document.pdf # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown # Markdown
python scripts/extract_pymupdf.py document.pdf --tables # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages
Inline:
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
print(page.get_text())
"
marker-pdf (high-quality OCR)
# Check disk space first
python scripts/extract_marker.py --check
pip install marker-pdf
Via helper script:
python scripts/extract_marker.py document.pdf # Markdown
python scripts/extract_marker.py document.pdf --json # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/ # Save images
python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy
CLI (installed with marker-pdf):
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4 # Batch
Arxiv Papers
# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
# Search
web_search(query="arxiv GRPO reinforcement learning 2026")
Split, Merge & Search
pymupdf handles these natively — use execute_code or inline Python:
# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")
# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")
# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
results = page.search_for("revenue")
if results:
print(f"Page {i+1}: {len(results)} match(es)")
print(page.get_text("text"))
No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.
Notes
web_extractis always first choice for URLs- pymupdf is the safe default — instant, no models, works everywhere
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
- Both helper scripts accept
--helpfor full usage - marker-pdf downloads ~2.5GB of models to
~/.cache/huggingface/on first use - For Word docs:
pip install python-docx(better than OCR — parses actual structure) - For PowerPoint: see the
powerpointskill (uses python-pptx)
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
agentmail
Give the agent its own dedicated email inbox via AgentMail. Send, receive, and manage email autonomously using agent-owned email addresses (e.g. hermes-agent@agentmail.to).
base
Query Base (Ethereum L2) blockchain data with USD pricing — wallet balances, token info, transaction details, gas analysis, contract inspection, whale detection, and live network stats. Uses Base RPC + CoinGecko. No API key required.
solana
Query Solana blockchain data with USD pricing — wallet balances, token portfolios with values, transaction details, NFTs, whale detection, and live network stats. Uses Solana RPC + CoinGecko. No API key required.
one-three-one-rule
Structured decision-making framework for technical proposals and trade-off analysis. When the user faces a choice between multiple approaches (architecture decisions, tool selection, refactoring strategies, migration paths), this skill produces a 1-3-1 format: one clear problem statement, three distinct options with pros/cons, and one concrete recommendation with definition of done and implementation plan. Use when the user asks for a "1-3-1", says "give me options", or needs help choosing between competing approaches.
fastmcp
Build, test, inspect, install, and deploy MCP servers with FastMCP in Python. Use when creating a new MCP server, wrapping an API or database as MCP tools, exposing resources or prompts, or preparing a FastMCP server for Claude Code, Cursor, or HTTP deployment.
qdrant-vector-search
High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.
Didn't find tool you were looking for?