Agent skills
great-expectations-validator

Agent skill

great-expectations-validator

Data quality validation skill using Great Expectations for schema validation, expectation suites, data documentation, and automated data quality checks in ML pipelines.

View SKILL.md on GitHub Repository

Stars 514

Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/a5c-ai/babysitter/tree/main/library/specializations/data-science-ml/skills/great-expectations-validator

SKILL.md

Great Expectations Validator

Validate data quality using Great Expectations for comprehensive data testing, documentation, and quality monitoring.

Overview

This skill provides capabilities for data quality validation using Great Expectations (GX), the leading open-source library for data quality. It enables creation and execution of expectation suites, data documentation generation, and integration with ML pipelines.

Capabilities

Expectation Suite Management

Create and configure expectation suites
Define expectations for columns and tables
Validate data against expectations
Store and version expectation suites

Data Validation

Schema validation (column presence, types)
Statistical validation (distributions, ranges)
Referential integrity checks
Custom SQL-based expectations
Regex pattern matching

Data Documentation

Generate data documentation (Data Docs)
Create profiling reports
Document validation results
Build data dictionaries

Pipeline Integration

Checkpoint configuration and execution
Batch request management
Action-based workflows (notifications, storage)
Integration with Airflow, Prefect, Dagster

Custom Expectations

Define domain-specific expectations
Parameterized expectations
Multi-column expectations
Row-condition based expectations

Prerequisites

Installation

bash

pip install great_expectations>=0.18.0

Optional Connectors

bash

# Database connectors
pip install great_expectations[sqlalchemy]

# Cloud storage
pip install great_expectations[s3]  # AWS
pip install great_expectations[gcs]  # GCP
pip install great_expectations[azure]  # Azure

# Spark support
pip install great_expectations[spark]

Usage Patterns

Initialize Great Expectations Project

bash

# Initialize GX project
great_expectations init

# Creates:
# great_expectations/
# ├── great_expectations.yml
# ├── expectations/
# ├── checkpoints/
# ├── plugins/
# └── uncommitted/

Create Expectation Suite from Profiler

python

import great_expectations as gx

# Initialize context
context = gx.get_context()

# Add datasource
datasource = context.sources.add_pandas("my_datasource")
data_asset = datasource.add_csv_asset("customers", filepath_or_buffer="customers.csv")

# Create batch request
batch_request = data_asset.build_batch_request()

# Create expectation suite with profiler
expectation_suite = context.add_or_update_expectation_suite("customer_suite")

validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="customer_suite"
)

# Profile and generate expectations
validator.expect_column_to_exist("customer_id")
validator.expect_column_values_to_be_unique("customer_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.expect_column_values_to_be_in_set("status", ["active", "inactive", "pending"])
validator.expect_column_values_to_match_regex("email", r"^[\w\.-]+@[\w\.-]+\.\w+$")

# Save suite
validator.save_expectation_suite(discard_failed_expectations=False)

Validate Data with Checkpoint

python

import great_expectations as gx

context = gx.get_context()

# Create checkpoint
checkpoint = context.add_or_update_checkpoint(
    name="customer_checkpoint",
    validations=[
        {
            "batch_request": {
                "datasource_name": "my_datasource",
                "data_asset_name": "customers"
            },
            "expectation_suite_name": "customer_suite"
        }
    ],
    action_list=[
        {
            "name": "store_validation_result",
            "action": {"class_name": "StoreValidationResultAction"}
        },
        {
            "name": "update_data_docs",
            "action": {"class_name": "UpdateDataDocsAction"}
        }
    ]
)

# Run checkpoint
result = checkpoint.run()

# Check results
if result.success:
    print("Validation passed!")
else:
    print("Validation failed!")
    for validation_result in result.run_results.values():
        for result in validation_result.results:
            if not result.success:
                print(f"Failed: {result.expectation_config.expectation_type}")

Common Expectations

python

# Column existence and types
validator.expect_column_to_exist("column_name")
validator.expect_column_values_to_be_of_type("column_name", "int64")
validator.expect_table_column_count_to_equal(10)

# Null handling
validator.expect_column_values_to_not_be_null("column_name")
validator.expect_column_values_to_be_null("deprecated_column")

# Uniqueness
validator.expect_column_values_to_be_unique("id_column")
validator.expect_compound_columns_to_be_unique(["col1", "col2"])

# Value ranges
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.expect_column_min_to_be_between("score", min_value=0)
validator.expect_column_max_to_be_between("score", max_value=100)

# Set membership
validator.expect_column_values_to_be_in_set("status", ["A", "B", "C"])
validator.expect_column_distinct_values_to_be_in_set("category", ["cat1", "cat2"])

# String patterns
validator.expect_column_values_to_match_regex("email", r"^[\w\.-]+@[\w\.-]+\.\w+$")
validator.expect_column_value_lengths_to_be_between("code", min_value=5, max_value=10)

# Statistical
validator.expect_column_mean_to_be_between("value", min_value=50, max_value=100)
validator.expect_column_stdev_to_be_between("value", min_value=0, max_value=20)
validator.expect_column_proportion_of_unique_values_to_be_between("id", min_value=0.9)

Integration with Babysitter SDK

Task Definition Example

javascript

const dataValidationTask = defineTask({
  name: 'great-expectations-validation',
  description: 'Validate data quality using Great Expectations',

  inputs: {
    dataPath: { type: 'string', required: true },
    expectationSuiteName: { type: 'string', required: true },
    checkpointName: { type: 'string' },
    failOnError: { type: 'boolean', default: true }
  },

  outputs: {
    success: { type: 'boolean' },
    validationResults: { type: 'object' },
    failedExpectations: { type: 'array' },
    dataDocsUrl: { type: 'string' }
  },

  async run(inputs, taskCtx) {
    return {
      kind: 'skill',
      title: `Validate data: ${inputs.expectationSuiteName}`,
      skill: {
        name: 'great-expectations-validator',
        context: {
          operation: 'validate',
          dataPath: inputs.dataPath,
          expectationSuiteName: inputs.expectationSuiteName,
          checkpointName: inputs.checkpointName,
          failOnError: inputs.failOnError
        }
      },
      io: {
        inputJsonPath: `tasks/${taskCtx.effectId}/input.json`,
        outputJsonPath: `tasks/${taskCtx.effectId}/result.json`
      }
    };
  }
});

MCP Server Integration

Using gx-mcp-server

json

{
  "mcpServers": {
    "great-expectations": {
      "command": "uvx",
      "args": ["gx-mcp-server"],
      "env": {
        "GX_CONTEXT_ROOT": "./great_expectations"
      }
    }
  }
}

Available MCP Tools

gx_list_datasources - List configured datasources
gx_list_expectation_suites - List expectation suites
gx_run_checkpoint - Execute a checkpoint
gx_validate_data - Validate data against suite
gx_get_validation_results - Retrieve validation results

ML Pipeline Integration

Training Data Validation

python

def validate_training_data(df, suite_name="training_data_suite"):
    """Validate training data before model training."""
    context = gx.get_context()

    # Add dataframe as datasource
    datasource = context.sources.add_pandas("training_data")
    data_asset = datasource.add_dataframe_asset("df")
    batch_request = data_asset.build_batch_request(dataframe=df)

    # Validate
    checkpoint = context.add_or_update_checkpoint(
        name="training_validation",
        validations=[{
            "batch_request": batch_request,
            "expectation_suite_name": suite_name
        }]
    )

    result = checkpoint.run()

    if not result.success:
        failed = [r for r in result.run_results.values()
                  for r in r.results if not r.success]
        raise ValueError(f"Training data validation failed: {len(failed)} expectations failed")

    return True

Feature Quality Checks

python

# Expectations for ML features
validator.expect_column_values_to_not_be_null("feature_1", mostly=0.95)
validator.expect_column_values_to_be_between("feature_1", min_value=-3, max_value=3)  # Standard scaled
validator.expect_column_proportion_of_unique_values_to_be_between("categorical_feature", min_value=0.001)
validator.expect_column_kl_divergence_to_be_less_than("feature_1",
    partition_object=reference_distribution,
    threshold=0.1)

Best Practices

Version Expectation Suites: Store suites in version control
Use Checkpoints: Always validate through checkpoints for consistency
Set Mostly Parameter: Allow for small data quality issues with mostly=0.95
Generate Data Docs: Document your data for team visibility
Fail Fast: Validate data early in pipelines
Custom Expectations: Create domain-specific expectations for your use case

References

Maintainer

a5c-ai Core maintainer

Source details

Full Name: a5c-ai/babysitter
Branch: main
Path in repo: library/specializations/data-science-ml/skills/great-expectations-validator
License: MIT License
Topics: claude-code agent-skills claude-code-skills ai-agents claude-skills vibe-coding agentic-workflow agentic-ai ai-automation agent-orchestration babysitter trustworthy-ai

Featured Tools

Join Our Newsletter

Stay updated with the latest AI tools, news, and offers by subscribing to our weekly newsletter.

Recommended Agent Skills

Expand your agent's capabilities with these related and highly-rated skills.

a5c-ai/babysitter

gsd-tools

Central utility skill for GSD operations. Provides config parsing, slug generation, timestamps, path operations, and orchestrates calls to other specialized skills. Acts as the unified entry point that the original gsd-tools.cjs provided via its lib/ modules (commands, config, core, init).

514 31

Explore

a5c-ai/babysitter

model-profile-resolution

Resolve model profile (quality/balanced/budget) at orchestration start and map agents to specific models. Enables cost/quality tradeoffs by selecting appropriate AI models for each agent role.

514 31

Explore

a5c-ai/babysitter

verification-suite

Plan structure validation, phase completeness checks, reference integrity verification, and artifact existence confirmation. Provides the structured verification layer ensuring GSD artifacts are well-formed and complete.

514 31

Explore

a5c-ai/babysitter

state-management

STATE.md reading, writing, and field-level updates. Provides cross-session state persistence via .planning/STATE.md with structured fields for current task, completed phases, blockers, decisions, and quick tasks.

514 31

Explore

a5c-ai/babysitter

git-integration

Git commit patterns, formats, and conventions for GSD methodology. Provides atomic commits per task, structured commit messages, planning file commits, branch management, and milestone tag operations.

514 31

Explore

a5c-ai/babysitter

frontmatter-parsing

YAML frontmatter parsing and manipulation for .planning/ documents. Provides read, write, update, query, and validation operations on frontmatter blocks in GSD markdown artifacts.

514 31

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

Great Expectations Validator

Overview

Capabilities

Expectation Suite Management

Data Validation

Data Documentation

Pipeline Integration

Custom Expectations

Prerequisites

Installation

Optional Connectors

Usage Patterns

Initialize Great Expectations Project

Create Expectation Suite from Profiler

Validate Data with Checkpoint

Common Expectations

Integration with Babysitter SDK

Task Definition Example

MCP Server Integration

Using gx-mcp-server

Available MCP Tools

ML Pipeline Integration

Training Data Validation

Feature Quality Checks

Best Practices

References

Recommended Agent Skills

gsd-tools

model-profile-resolution

verification-suite

state-management

git-integration

frontmatter-parsing