Agent skill

data-wrangler

Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.

Stars 2
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/richard-gyiko/data-wrangler-plugin/tree/main/skills/data-wrangler

SKILL.md

Data Wrangler

Transform and export data using DuckDB SQL.

Contents

  • Usage - Command syntax and Windows escaping
  • Explore Mode - Quick data profiling
  • Query Mode - Return results to Claude
  • Write Mode - Export to files
  • Request/Response Format - JSON structure
  • Source Types - File, database, and cloud sources
  • Transformations - SQL patterns reference
  • Secrets - Secure credential handling

Usage

IMPORTANT - Windows Shell Escaping:

  1. Always cd to the skill directory first
  2. Use double quotes for echo with escaped inner quotes (\")
  3. Use forward slashes in file paths
bash
cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py

Explore Mode

Get schema, statistics, and sample in one call. Use before writing queries to understand data structure.

json
{"mode": "explore", "path": "D:/data/sales.csv"}

Response:

json
{
  "file": "D:/data/sales.csv",
  "format": "csv",
  "row_count": 15234,
  "columns": [
    {"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
    {"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
  ],
  "sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001     | Alice    | ... |"
}

Options:

  • sample_rows: Number of sample rows (default: 10, max: 100)
  • sources: For database tables (same as query mode)

Query Mode

Return results directly to Claude for analysis.

Direct File Queries

json
{"query": "SELECT * FROM 'data.csv' LIMIT 10"}

Multi-Source Joins

json
{
  "query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
  "sources": [
    {"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
    {"type": "file", "alias": "products", "path": "/data/products.csv"}
  ]
}

Write Mode

Export query results to files. Add an output object to write instead of returning data.

Basic Write

json
{
  "query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
  "output": {
    "path": "D:/output/filtered.parquet",
    "format": "parquet"
  }
}

Write with Options

json
{
  "query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
  "output": {
    "path": "D:/output/events/",
    "format": "parquet",
    "options": {
      "compression": "zstd",
      "partition_by": ["year", "month"],
      "overwrite": true
    }
  }
}

Output Formats

Format Options
parquet compression (zstd/snappy/gzip/lz4), partition_by, row_group_size
csv header (default: true), delimiter, compression, partition_by
json array (true=JSON array, false=newline-delimited)

Write Response

Response includes verification info - no need for follow-up queries:

json
{
  "success": true,
  "output_path": "D:/output/events/",
  "format": "parquet",
  "rows_written": 15234,
  "files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
  "total_size_bytes": 5678901,
  "duration_ms": 1234
}

Overwrite Protection

By default, existing files are not overwritten. Set options.overwrite: true to allow.

Request/Response Format

Request

json
{
  "query": "SQL statement",
  "sources": [...],
  "output": {"path": "...", "format": "..."},
  "options": {"max_rows": 200, "format": "markdown"},
  "secrets_file": "path/to/secrets.yaml"
}

Query Mode Options

  • max_rows: Maximum rows to return (default: 200)
  • max_bytes: Maximum response size (default: 200000)
  • format: markdown (default), json, records, or csv

Query Mode Response (markdown)

| column1 | column2 |
|---|---|
| value1 | value2 |

Query Mode Response (json)

json
{
  "schema": [{"name": "col1", "type": "INTEGER"}],
  "rows": [[1, "value"]],
  "truncated": false,
  "warnings": [],
  "error": null
}

Source Types

File (auto-detects CSV, Parquet, JSON, Excel)

json
{"type": "file", "alias": "data", "path": "/path/to/file.csv"}

Glob patterns: {"path": "/logs/**/*.parquet"}

Custom delimiter: {"path": "/data/file.csv", "delimiter": "|"}

PostgreSQL

json
{
  "type": "postgres", "alias": "users",
  "host": "host", "port": 5432, "database": "db",
  "user": "user", "password": "pass",
  "schema": "public", "table": "users"
}

MySQL

json
{
  "type": "mysql", "alias": "orders",
  "host": "host", "port": 3306, "database": "db",
  "user": "user", "password": "pass", "table": "orders"
}

SQLite

json
{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}

S3

json
{
  "type": "s3", "alias": "logs",
  "url": "s3://bucket/path/*.parquet",
  "aws_region": "us-east-1",
  "aws_access_key_id": "...", "aws_secret_access_key": "..."
}

Transformations

See TRANSFORMS.md for advanced patterns including:

  • PIVOT/UNPIVOT - Reshape data between wide and long formats
  • Sampling - Random subsets with USING SAMPLE n ROWS or SAMPLE 10%
  • Dynamic columns - EXCLUDE, REPLACE, COLUMNS('pattern')
  • Window functions - Running totals, rankings, moving averages
  • Date/time operations - Extraction, arithmetic, formatting

Quick Examples

sql
-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region

-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount

-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)

-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users

Workflow

  1. Inspect schema: DESCRIBE SELECT * FROM 'file.csv'
  2. Preview data: SELECT * FROM 'file.csv' LIMIT 5
  3. Transform: Apply filters, joins, aggregations
  4. Export (optional): Add output to write results

Error Handling

  • If error is non-null: Check column names, verify paths
  • If truncated is true: Use more aggregation or filters
  • If write fails with "exists": Set options.overwrite: true

Secrets

Store credentials securely in YAML. See SECRETS.md for complete documentation.

json
{
  "query": "SELECT * FROM customers LIMIT 10",
  "secrets_file": "D:/path/to/secrets.yaml",
  "sources": [{
    "type": "postgres", "alias": "customers",
    "secret": "my_postgres", "table": "customers"
  }]
}

Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.

Expand your agent's capabilities with these related and highly-rated skills.

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298
Explore
davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

23,776 2,298
Explore
davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

23,776 2,298
Explore
davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

23,776 2,298
Explore
davila7/claude-code-templates

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

23,776 2,298
Explore
davila7/claude-code-templates

behavioral-modes

AI operational modes (brainstorm, implement, debug, review, teach, ship, orchestrate). Use to adapt behavior based on task type.

23,776 2,298
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results