Agent skill

data-wrangler

Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.

View SKILL.md on GitHub Repository

Stars 2

Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/richard-gyiko/data-wrangler-plugin/tree/main/skills/data-wrangler

SKILL.md

Data Wrangler

Transform and export data using DuckDB SQL.

Usage - Command syntax and Windows escaping
Explore Mode - Quick data profiling
Query Mode - Return results to Claude
Write Mode - Export to files
Request/Response Format - JSON structure
Source Types - File, database, and cloud sources
Transformations - SQL patterns reference
Secrets - Secure credential handling

Usage

IMPORTANT - Windows Shell Escaping:

Always cd to the skill directory first
Use double quotes for echo with escaped inner quotes (\")
Use forward slashes in file paths

bash

cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py

Explore Mode

Get schema, statistics, and sample in one call. Use before writing queries to understand data structure.

json

{"mode": "explore", "path": "D:/data/sales.csv"}

Response:

json

{
  "file": "D:/data/sales.csv",
  "format": "csv",
  "row_count": 15234,
  "columns": [
    {"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
    {"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
  ],
  "sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001     | Alice    | ... |"
}

Options:

sample_rows: Number of sample rows (default: 10, max: 100)
sources: For database tables (same as query mode)

Query Mode

Return results directly to Claude for analysis.

Direct File Queries

json

{"query": "SELECT * FROM 'data.csv' LIMIT 10"}

Multi-Source Joins

json

{
  "query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
  "sources": [
    {"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
    {"type": "file", "alias": "products", "path": "/data/products.csv"}
  ]
}

Write Mode

Export query results to files. Add an output object to write instead of returning data.

Basic Write

json

{
  "query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
  "output": {
    "path": "D:/output/filtered.parquet",
    "format": "parquet"
  }
}

Write with Options

json

{
  "query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
  "output": {
    "path": "D:/output/events/",
    "format": "parquet",
    "options": {
      "compression": "zstd",
      "partition_by": ["year", "month"],
      "overwrite": true
    }
  }
}

Output Formats

Format	Options
`parquet`	`compression` (zstd/snappy/gzip/lz4), `partition_by`, `row_group_size`
`csv`	`header` (default: true), `delimiter`, `compression`, `partition_by`
`json`	`array` (true=JSON array, false=newline-delimited)

Write Response

Response includes verification info - no need for follow-up queries:

json

{
  "success": true,
  "output_path": "D:/output/events/",
  "format": "parquet",
  "rows_written": 15234,
  "files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
  "total_size_bytes": 5678901,
  "duration_ms": 1234
}

Overwrite Protection

By default, existing files are not overwritten. Set options.overwrite: true to allow.

Request/Response Format

Request

json

{
  "query": "SQL statement",
  "sources": [...],
  "output": {"path": "...", "format": "..."},
  "options": {"max_rows": 200, "format": "markdown"},
  "secrets_file": "path/to/secrets.yaml"
}

Query Mode Options

max_rows: Maximum rows to return (default: 200)
max_bytes: Maximum response size (default: 200000)
format: markdown (default), json, records, or csv

Query Mode Response (markdown)

| column1 | column2 |
|---|---|
| value1 | value2 |

Query Mode Response (json)

json

{
  "schema": [{"name": "col1", "type": "INTEGER"}],
  "rows": [[1, "value"]],
  "truncated": false,
  "warnings": [],
  "error": null
}

Source Types

File (auto-detects CSV, Parquet, JSON, Excel)

json

{"type": "file", "alias": "data", "path": "/path/to/file.csv"}

Glob patterns: {"path": "/logs/**/*.parquet"}

Custom delimiter: {"path": "/data/file.csv", "delimiter": "|"}

PostgreSQL

json

{
  "type": "postgres", "alias": "users",
  "host": "host", "port": 5432, "database": "db",
  "user": "user", "password": "pass",
  "schema": "public", "table": "users"
}

MySQL

json

{
  "type": "mysql", "alias": "orders",
  "host": "host", "port": 3306, "database": "db",
  "user": "user", "password": "pass", "table": "orders"
}

SQLite

json

{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}

S3

json

{
  "type": "s3", "alias": "logs",
  "url": "s3://bucket/path/*.parquet",
  "aws_region": "us-east-1",
  "aws_access_key_id": "...", "aws_secret_access_key": "..."
}

Transformations

See TRANSFORMS.md for advanced patterns including:

PIVOT/UNPIVOT - Reshape data between wide and long formats
Sampling - Random subsets with USING SAMPLE n ROWS or SAMPLE 10%
Dynamic columns - EXCLUDE, REPLACE, COLUMNS('pattern')
Window functions - Running totals, rankings, moving averages
Date/time operations - Extraction, arithmetic, formatting

Quick Examples

sql

-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region

-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount

-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)

-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users

Workflow

Inspect schema: DESCRIBE SELECT * FROM 'file.csv'
Preview data: SELECT * FROM 'file.csv' LIMIT 5
Transform: Apply filters, joins, aggregations
Export (optional): Add output to write results

Error Handling

If error is non-null: Check column names, verify paths
If truncated is true: Use more aggregation or filters
If write fails with "exists": Set options.overwrite: true

Secrets

Store credentials securely in YAML. See SECRETS.md for complete documentation.

json

{
  "query": "SELECT * FROM customers LIMIT 10",
  "secrets_file": "D:/path/to/secrets.yaml",
  "sources": [{
    "type": "postgres", "alias": "customers",
    "secret": "my_postgres", "table": "customers"
  }]
}

Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.

Maintainer

richard-gyiko Core maintainer

Source details

Full Name: richard-gyiko/data-wrangler-plugin
Branch: main
Path in repo: skills/data-wrangler
License: MIT License
Topics: claude-code python claude-code-plugin claude-code-skill data-analysis analytics sql csv duckdb parquet

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298

Explore

davila7/claude-code-templates

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

23,776 2,298

Explore

davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

23,776 2,298

Explore

davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

23,776 2,298

Explore

davila7/claude-code-templates

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

23,776 2,298

Explore

davila7/claude-code-templates

behavioral-modes

AI operational modes (brainstorm, implement, debug, review, teach, ship, orchestrate). Use to adapt behavior based on task type.

23,776 2,298

Explore

Didn't find tool you were looking for?

Search AI Tools

data-wrangler

Install this agent skill to your Project

SKILL.md

Data Wrangler

Contents

Usage

Explore Mode

Query Mode

Direct File Queries

Multi-Source Joins

Write Mode

Basic Write

Write with Options

Output Formats

Write Response

Overwrite Protection

Request/Response Format

Request

Query Mode Options

Query Mode Response (markdown)

Query Mode Response (json)

Source Types

File (auto-detects CSV, Parquet, JSON, Excel)

PostgreSQL

MySQL

SQLite

S3

Transformations

Quick Examples

Workflow

Error Handling

Secrets

Recommended Agent Skills

verl-rl-training

openrlhf-training

gguf-quantization

Claude Code Guide

qdrant-vector-search

behavioral-modes