Agent skills
doc-to-vector-dataset-generato...

Agent skill

doc-to-vector-dataset-generator

Converts documents into clean, chunked datasets suitable for embeddings and vector search. Produces chunked JSONL files with metadata, deduplication logic, and quality checks. Use when preparing "training data", "vector datasets", "document processing", or "embedding data".

View SKILL.md on GitHub Repository

Stars 23

Forks 2

Install this agent skill to your Project

npx add-skill https://github.com/patricio0312rev/skills/tree/main/ai-engineering/doc-to-vector-dataset-generator

SKILL.md

Doc-to-Vector Dataset Generator

Transform documents into high-quality vector search datasets.

Pipeline Steps

Extract text from various formats (PDF, DOCX, HTML)
Clean text (remove noise, normalize)
Chunk strategically (semantic boundaries)
Add metadata (source, timestamps, classification)
Deduplicate (near-duplicate detection)
Quality check (length, content validation)
Export JSONL (one chunk per line)

Text Extraction

python

# PDF extraction
import pymupdf

def extract_pdf(filepath: str) -> str:
    doc = pymupdf.open(filepath)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# Markdown extraction
def extract_markdown(filepath: str) -> str:
    with open(filepath) as f:
        return f.read()

Text Cleaning

python

import re

def clean_text(text: str) -> str:
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)

    # Remove page numbers
    text = re.sub(r'Page \d+', '', text)

    # Remove URLs (optional)
    text = re.sub(r'http\S+', '', text)

    # Normalize unicode
    text = text.encode('utf-8', 'ignore').decode('utf-8')

    return text.strip()

Semantic Chunking

python

def semantic_chunk(text: str, max_chunk_size: int = 1000) -> List[str]:
    """Chunk at semantic boundaries (paragraphs, sentences)"""
    # Split by paragraphs first
    paragraphs = text.split('\n\n')

    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) <= max_chunk_size:
            current_chunk += para + "\n\n"
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Metadata Extraction

python

def extract_metadata(filepath: str, chunk: str, chunk_idx: int) -> dict:
    return {
        "source": filepath,
        "chunk_id": f"{hash(filepath)}_{chunk_idx}",
        "chunk_index": chunk_idx,
        "char_count": len(chunk),
        "word_count": len(chunk.split()),
        "created_at": datetime.now().isoformat(),

        # Content classification
        "has_code": bool(re.search(r'```|def |class |function', chunk)),
        "has_table": bool(re.search(r'\|.*\|', chunk)),
        "language": detect_language(chunk),
    }

Deduplication

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def deduplicate_chunks(chunks: List[dict], threshold: float = 0.95) -> List[dict]:
    """Remove near-duplicate chunks"""
    texts = [c["text"] for c in chunks]

    # Compute TF-IDF vectors
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(texts)

    # Compute pairwise similarity
    similarity_matrix = cosine_similarity(vectors)

    # Find duplicates
    to_remove = set()
    for i in range(len(chunks)):
        if i in to_remove:
            continue
        for j in range(i+1, len(chunks)):
            if similarity_matrix[i][j] > threshold:
                to_remove.add(j)

    # Return unique chunks
    return [c for i, c in enumerate(chunks) if i not in to_remove]

Quality Checks

python

def quality_check(chunk: dict) -> bool:
    """Validate chunk quality"""
    text = chunk["text"]

    # Min length check
    if len(text) < 50:
        return False

    # Max length check
    if len(text) > 5000:
        return False

    # Content check (not just numbers/symbols)
    alpha_ratio = sum(c.isalpha() for c in text) / len(text)
    if alpha_ratio < 0.5:
        return False

    # Language check (English only)
    if chunk["metadata"]["language"] != "en":
        return False

    return True

JSONL Export

python

import json

def export_jsonl(chunks: List[dict], output_path: str):
    """Export chunks as JSONL (one JSON object per line)"""
    with open(output_path, 'w') as f:
        for chunk in chunks:
            f.write(json.dumps(chunk) + '\n')

# Example output format
{
  "text": "Chunk text content here...",
  "metadata": {
    "source": "docs/auth.md",
    "chunk_id": "abc123_0",
    "chunk_index": 0,
    "char_count": 542,
    "word_count": 89,
    "has_code": true
  }
}

Complete Pipeline

python

def process_documents(input_dir: str, output_path: str):
    all_chunks = []

    # Process each document
    for filepath in glob(f"{input_dir}/**/*.md"):
        # Extract and clean
        text = extract_markdown(filepath)
        text = clean_text(text)

        # Chunk
        chunks = semantic_chunk(text)

        # Add metadata
        for i, chunk in enumerate(chunks):
            chunk_obj = {
                "text": chunk,
                "metadata": extract_metadata(filepath, chunk, i)
            }

            # Quality check
            if quality_check(chunk_obj):
                all_chunks.append(chunk_obj)

    # Deduplicate
    unique_chunks = deduplicate_chunks(all_chunks)

    # Export
    export_jsonl(unique_chunks, output_path)

    print(f"Processed {len(unique_chunks)} chunks")

Best Practices

Chunk at semantic boundaries
Rich metadata for filtering
Deduplicate aggressively
Quality checks prevent garbage
JSONL format for streaming
Version your datasets

Output Checklist

Text extraction from all formats
Cleaning pipeline implemented
Semantic chunking strategy
Metadata schema defined
Deduplication logic
Quality validation checks
JSONL export format
Dataset statistics logged

Maintainer

patricio0312rev Core maintainer

Source details

Full Name: patricio0312rev/skills
Branch: main
Path in repo: ai-engineering/doc-to-vector-dataset-generator
License: MIT License
Topics: ai claude-code claude cursor skills copilot-coding-agent cursor-ai

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

patricio0312rev/skills

rate-limiting-abuse-protection

Implements rate limiting and abuse prevention with per-route policies, IP/user-based limits, sliding windows, safe error responses, and observability. Use when adding "rate limiting", "API protection", "abuse prevention", or "DDoS protection".

23 2

Explore

patricio0312rev/skills

rbac-permissions-builder

Implements role-based access control with permission matrix, route guards, policy functions, and UI permission hints. Provides middleware/guards, helper utilities, test suggestions, and permission checking patterns. Use when building "RBAC", "permissions", "access control", or "authorization".

23 2

Explore

patricio0312rev/skills

websocket-realtime-builder

Implements real-time features using WebSockets with Socket.io, rooms, authentication, and reconnection handling. Use when users request "real-time updates", "WebSocket", "Socket.io", "live chat", or "push notifications".

23 2

Explore

patricio0312rev/skills

webhook-receiver-hardener

Secures webhook receivers with signature verification, retry handling, deduplication, idempotency keys, and error responses. Provides verification code, dedupe storage strategy, runbook for incidents. Use when implementing "webhooks", "webhook security", "event receivers", or "third-party integrations".

23 2

Explore

patricio0312rev/skills

auth-module-builder

Implements secure authentication patterns including login/registration, session management, JWT tokens, password hashing, cookie settings, and CSRF protection. Provides auth routes, middleware, security configurations, and threat model documentation. Use when building "authentication", "login system", "JWT auth", or "session management".

23 2

Explore

patricio0312rev/skills

rest-to-graphql-migrator

Migrates REST APIs to GraphQL incrementally with schema stitching, REST datasources, and gradual endpoint migration. Use when users request "migrate to GraphQL", "REST to GraphQL", "GraphQL wrapper", or "API modernization".

23 2

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Doc-to-Vector Dataset Generator

Pipeline Steps

Text Extraction

Text Cleaning

Semantic Chunking

Metadata Extraction

Deduplication

Quality Checks

JSONL Export

Complete Pipeline

Best Practices

Output Checklist

Recommended Agent Skills

rate-limiting-abuse-protection

rbac-permissions-builder

websocket-realtime-builder

webhook-receiver-hardener

auth-module-builder

rest-to-graphql-migrator