Agent skill

nemo-curator

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

Stars 56,643
Forks 7,481

Install this agent skill to your Project

npx add-skill https://github.com/NousResearch/hermes-agent/tree/main/optional-skills/mlops/nemo-curator

Metadata

Additional technical details for this skill

hermes
{
    "tags": [
        "Data Processing",
        "NeMo Curator",
        "Data Curation",
        "GPU Acceleration",
        "Deduplication",
        "Quality Filtering",
        "NVIDIA",
        "RAPIDS",
        "PII Redaction",
        "Multimodal",
        "LLM Training Data"
    ]
}

SKILL.md

NeMo Curator - GPU-Accelerated Data Curation

NVIDIA's toolkit for preparing high-quality training data for LLMs.

When to use NeMo Curator

Use NeMo Curator when:

  • Preparing LLM training data from web scrapes (Common Crawl)
  • Need fast deduplication (16× faster than CPU)
  • Curating multi-modal datasets (text, images, video, audio)
  • Filtering low-quality or toxic content
  • Scaling data processing across GPU cluster

Performance:

  • 16× faster fuzzy deduplication (8TB RedPajama v2)
  • 40% lower TCO vs CPU alternatives
  • Near-linear scaling across GPU nodes

Use alternatives instead:

  • datatrove: CPU-based, open-source data processing
  • dolma: Allen AI's data toolkit
  • Ray Data: General ML data processing (no curation focus)

Quick start

Installation

bash
# Text curation (CUDA 12)
uv pip install "nemo-curator[text_cuda12]"

# All modalities
uv pip install "nemo-curator[all_cuda12]"

# CPU-only (slower)
uv pip install "nemo-curator[cpu]"

Basic text curation pipeline

python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.datasets import DocumentDataset
import pandas as pd

# Load data
df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]})
dataset = DocumentDataset(df)

# Quality filtering
def quality_score(doc):
    return len(doc["text"].split()) > 5  # Filter short docs

filtered = ScoreFilter(quality_score)(dataset)

# Deduplication
from nemo_curator.modules import ExactDuplicates
deduped = ExactDuplicates()(filtered)

# Save
deduped.to_parquet("curated_data/")

Data curation pipeline

Stage 1: Quality filtering

python
from nemo_curator.filters import (
    WordCountFilter,
    RepeatedLinesFilter,
    UrlRatioFilter,
    NonAlphaNumericFilter
)

# Apply 30+ heuristic filters
from nemo_curator import ScoreFilter

# Word count filter
dataset = dataset.filter(WordCountFilter(min_words=50, max_words=100000))

# Remove repetitive content
dataset = dataset.filter(RepeatedLinesFilter(max_repeated_line_fraction=0.3))

# URL ratio filter
dataset = dataset.filter(UrlRatioFilter(max_url_ratio=0.2))

Stage 2: Deduplication

Exact deduplication:

python
from nemo_curator.modules import ExactDuplicates

# Remove exact duplicates
deduped = ExactDuplicates(id_field="id", text_field="text")(dataset)

Fuzzy deduplication (16× faster on GPU):

python
from nemo_curator.modules import FuzzyDuplicates

# MinHash + LSH deduplication
fuzzy_dedup = FuzzyDuplicates(
    id_field="id",
    text_field="text",
    num_hashes=260,      # MinHash parameters
    num_buckets=20,
    hash_method="md5"
)

deduped = fuzzy_dedup(dataset)

Semantic deduplication:

python
from nemo_curator.modules import SemanticDuplicates

# Embedding-based deduplication
semantic_dedup = SemanticDuplicates(
    id_field="id",
    text_field="text",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    threshold=0.8  # Cosine similarity threshold
)

deduped = semantic_dedup(dataset)

Stage 3: PII redaction

python
from nemo_curator.modules import Modify
from nemo_curator.modifiers import PIIRedactor

# Redact personally identifiable information
pii_redactor = PIIRedactor(
    supported_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON", "LOCATION"],
    anonymize_action="replace"  # or "redact"
)

redacted = Modify(pii_redactor)(dataset)

Stage 4: Classifier filtering

python
from nemo_curator.classifiers import QualityClassifier

# Quality classification
quality_clf = QualityClassifier(
    model_path="nvidia/quality-classifier-deberta",
    batch_size=256,
    device="cuda"
)

# Filter low-quality documents
high_quality = dataset.filter(lambda doc: quality_clf(doc["text"]) > 0.5)

GPU acceleration

GPU vs CPU performance

Operation CPU (16 cores) GPU (A100) Speedup
Fuzzy dedup (8TB) 120 hours 7.5 hours 16×
Exact dedup (1TB) 8 hours 0.5 hours 16×
Quality filtering 2 hours 0.2 hours 10×

Multi-GPU scaling

python
from nemo_curator import get_client
import dask_cuda

# Initialize GPU cluster
client = get_client(cluster_type="gpu", n_workers=8)

# Process with 8 GPUs
deduped = FuzzyDuplicates(...)(dataset)

Multi-modal curation

Image curation

python
from nemo_curator.image import (
    AestheticFilter,
    NSFWFilter,
    CLIPEmbedder
)

# Aesthetic scoring
aesthetic_filter = AestheticFilter(threshold=5.0)
filtered_images = aesthetic_filter(image_dataset)

# NSFW detection
nsfw_filter = NSFWFilter(threshold=0.9)
safe_images = nsfw_filter(filtered_images)

# Generate CLIP embeddings
clip_embedder = CLIPEmbedder(model="openai/clip-vit-base-patch32")
image_embeddings = clip_embedder(safe_images)

Video curation

python
from nemo_curator.video import (
    SceneDetector,
    ClipExtractor,
    InternVideo2Embedder
)

# Detect scenes
scene_detector = SceneDetector(threshold=27.0)
scenes = scene_detector(video_dataset)

# Extract clips
clip_extractor = ClipExtractor(min_duration=2.0, max_duration=10.0)
clips = clip_extractor(scenes)

# Generate embeddings
video_embedder = InternVideo2Embedder()
video_embeddings = video_embedder(clips)

Audio curation

python
from nemo_curator.audio import (
    ASRInference,
    WERFilter,
    DurationFilter
)

# ASR transcription
asr = ASRInference(model="nvidia/stt_en_fastconformer_hybrid_large_pc")
transcribed = asr(audio_dataset)

# Filter by WER (word error rate)
wer_filter = WERFilter(max_wer=0.3)
high_quality_audio = wer_filter(transcribed)

# Duration filtering
duration_filter = DurationFilter(min_duration=1.0, max_duration=30.0)
filtered_audio = duration_filter(high_quality_audio)

Common patterns

Web scrape curation (Common Crawl)

python
from nemo_curator import ScoreFilter, Modify
from nemo_curator.filters import *
from nemo_curator.modules import *
from nemo_curator.datasets import DocumentDataset

# Load Common Crawl data
dataset = DocumentDataset.read_parquet("common_crawl/*.parquet")

# Pipeline
pipeline = [
    # 1. Quality filtering
    WordCountFilter(min_words=100, max_words=50000),
    RepeatedLinesFilter(max_repeated_line_fraction=0.2),
    SymbolToWordRatioFilter(max_symbol_to_word_ratio=0.3),
    UrlRatioFilter(max_url_ratio=0.3),

    # 2. Language filtering
    LanguageIdentificationFilter(target_languages=["en"]),

    # 3. Deduplication
    ExactDuplicates(id_field="id", text_field="text"),
    FuzzyDuplicates(id_field="id", text_field="text", num_hashes=260),

    # 4. PII redaction
    PIIRedactor(),

    # 5. NSFW filtering
    NSFWClassifier(threshold=0.8)
]

# Execute
for stage in pipeline:
    dataset = stage(dataset)

# Save
dataset.to_parquet("curated_common_crawl/")

Distributed processing

python
from nemo_curator import get_client
from dask_cuda import LocalCUDACluster

# Multi-GPU cluster
cluster = LocalCUDACluster(n_workers=8)
client = get_client(cluster=cluster)

# Process large dataset
dataset = DocumentDataset.read_parquet("s3://large_dataset/*.parquet")
deduped = FuzzyDuplicates(...)(dataset)

# Cleanup
client.close()
cluster.close()

Performance benchmarks

Fuzzy deduplication (8TB RedPajama v2)

  • CPU (256 cores): 120 hours
  • GPU (8× A100): 7.5 hours
  • Speedup: 16×

Exact deduplication (1TB)

  • CPU (64 cores): 8 hours
  • GPU (4× A100): 0.5 hours
  • Speedup: 16×

Quality filtering (100GB)

  • CPU (32 cores): 2 hours
  • GPU (2× A100): 0.2 hours
  • Speedup: 10×

Cost comparison

CPU-based curation (AWS c5.18xlarge × 10):

  • Cost: $3.60/hour × 10 = $36/hour
  • Time for 8TB: 120 hours
  • Total: $4,320

GPU-based curation (AWS p4d.24xlarge × 2):

  • Cost: $32.77/hour × 2 = $65.54/hour
  • Time for 8TB: 7.5 hours
  • Total: $491.55

Savings: 89% reduction ($3,828 saved)

Supported data formats

  • Input: Parquet, JSONL, CSV
  • Output: Parquet (recommended), JSONL
  • WebDataset: TAR archives for multi-modal

Use cases

Production deployments:

  • NVIDIA used NeMo Curator to prepare Nemotron-4 training data
  • Open-source datasets curated: RedPajama v2, The Pile

References

  • Filtering Guide - 30+ quality filters, heuristics
  • Deduplication Guide - Exact, fuzzy, semantic methods

Resources

Expand your agent's capabilities with these related and highly-rated skills.

NousResearch/hermes-agent

agentmail

Give the agent its own dedicated email inbox via AgentMail. Send, receive, and manage email autonomously using agent-owned email addresses (e.g. hermes-agent@agentmail.to).

56,643 7,481
Explore
NousResearch/hermes-agent

base

Query Base (Ethereum L2) blockchain data with USD pricing — wallet balances, token info, transaction details, gas analysis, contract inspection, whale detection, and live network stats. Uses Base RPC + CoinGecko. No API key required.

56,643 7,481
Explore
NousResearch/hermes-agent

solana

Query Solana blockchain data with USD pricing — wallet balances, token portfolios with values, transaction details, NFTs, whale detection, and live network stats. Uses Solana RPC + CoinGecko. No API key required.

56,643 7,481
Explore
NousResearch/hermes-agent

one-three-one-rule

Structured decision-making framework for technical proposals and trade-off analysis. When the user faces a choice between multiple approaches (architecture decisions, tool selection, refactoring strategies, migration paths), this skill produces a 1-3-1 format: one clear problem statement, three distinct options with pros/cons, and one concrete recommendation with definition of done and implementation plan. Use when the user asks for a "1-3-1", says "give me options", or needs help choosing between competing approaches.

56,643 7,481
Explore
NousResearch/hermes-agent

fastmcp

Build, test, inspect, install, and deploy MCP servers with FastMCP in Python. Use when creating a new MCP server, wrapping an API or database as MCP tools, exposing resources or prompts, or preparing a FastMCP server for Claude Code, Cursor, or HTTP deployment.

56,643 7,481
Explore
NousResearch/hermes-agent

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

56,643 7,481
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results