Agent skill

bigquery-etl-core

The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.

Stars 6
Forks 1

Install this agent skill to your Project

npx add-skill https://github.com/mozilla/bigquery-etl-skills/tree/main/skills/bigquery-etl-core

SKILL.md

BigQuery ETL Core

Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction

Project Overview

The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.

Each table/query typically consists of three files:

  • query.sql OR query.py - The query definition (SQL or Python)
  • metadata.yaml - Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)
  • schema.yaml - BigQuery schema definition with field types and descriptions (see metadata-manager skill)

Note: Most tables use query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.

🚨 REQUIRED READING - Start Here

When starting work in bigquery-etl, READ these foundational references:

  1. Naming Conventions: READ references/naming_conventions.md

    • Table naming patterns
    • Dataset organization
    • Version suffix conventions
  2. Dataset Organization: READ references/dataset_naming_conventions.md

    • Common dataset suffixes (_derived, _stable, _live)
    • When to use each dataset type
    • Dataset naming rules
  3. Schema Resources: READ references/discovery_resources.md

    • Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
    • Priority order for schema lookup during construction
    • Common mozfun UDFs
  4. Privacy Guidelines: READ references/privacy_guidelines.md

    • Data handling requirements
    • PII considerations
    • Workgroup access patterns

Directory Structure

sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml

See assets/directory_structure_example.txt for detailed examples.

Key principles:

  • Always flat: sql/{project}/{dataset}/{table_name}/
  • Never use subdirectories within table directories
  • Table names always include version suffix (_v1, _v2, etc.)

Schema & Description Resources for Construction

Finding Schema Descriptions

Priority order for schema lookup during construction:

  1. Local files first: Check sql/*/schema.yaml and metadata.yaml files

    • Most reliable and up-to-date source
    • Contains field descriptions written by table owners
  2. Glean Dictionary: For _live and _stable tables

  3. ProbeInfo API: For Glean metric metadata

    • Endpoints: https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics
    • Provides metric definitions and descriptions programmatically
    • Use for validating metric references in queries
  4. DataHub MCP: Only as last resort

    • MUST READ references/datahub_best_practices.md BEFORE any DataHub queries
    • Use for schema lookup when not available in local files or Glean Dictionary
    • Extract ONLY necessary fields (column names, types, descriptions)
    • Use for downstream impact analysis when modifying tables

See references/discovery_resources.md for:

  • Detailed guidance on each schema source
  • ProbeInfo API endpoints and usage patterns
  • Glean Dictionary URL patterns for different products
  • DataHub MCP best practices for construction
  • Common mozfun UDFs
  • Key documentation links

Naming Conventions

Table Names:

  • Use snake_case with version suffix: clients_daily_event_v1
  • Common suffixes: _daily, _hourly, _aggregates, _summary

Field Names:

  • Use snake_case: submission_date, client_id, n_total_events
  • Prefix counts with n_: n_events, n_sessions
  • Standard Mozilla fields: submission_date, client_id, sample_id, normalized_channel, normalized_country_code, app_version

See references/naming_conventions.md for:

  • Complete naming patterns and conventions
  • Reserved/common patterns to avoid
  • BigQuery project naming conventions

Dataset Organization

See references/dataset_naming_conventions.md for:

  • Dataset naming patterns by suffix (_derived, _external, etc.)
  • Common dataset prefixes by product/source
  • Table versioning patterns
  • Incremental vs full refresh query patterns

Privacy & Data Handling

Mozilla follows strict data privacy policies:

  • No PII in derived tables
  • Use client-level identifiers (client_id) not individual identifiers
  • Respect data retention policies (~2 years for client-level data)
  • Label client-level tables with table_type: client_level in metadata.yaml

See references/privacy_guidelines.md for:

  • Key principles from Mozilla's data platform
  • Geo IP lookup and user agent parsing policies
  • Best practices for data handling
  • Deletion request support
  • Sample ID usage for sampling

BigQuery & Mozilla Conventions

Partitioning & Clustering

  • Most tables use day partitioning on submission_date
  • Clustering improves query performance for filtered/joined fields
  • See metadata-manager skill for detailed partitioning and clustering configuration

Common UDFs (mozfun)

Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/

Common functions:

  • mozfun.map.get_key() - Extract values from key-value maps
  • mozfun.norm.truncate_version() - Normalize version strings
  • mozfun.stats.mode_last() - Statistical mode calculation

UDF source code in sql/mozfun/ directory.

Glean Overview

Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.

Key concepts:

  • Metric types: Counter, boolean, string, event, etc.
  • Pings: Collections of metrics (e.g., baseline, events, metrics)
  • Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)

Common Glean datasets in BigQuery:

  • Pattern: {app_id}.{ping_name} (e.g., org_mozilla_fenix.baseline)
  • All have auto-generated schemas based on metric definitions

See references/glean_overview.md for:

  • What is Glean and how it differs from Firefox Desktop Telemetry
  • Glean SDK and metric type details
  • Common Glean datasets in BigQuery
  • When to use Glean Dictionary

bigquery-etl CLI Commands

See references/bqetl_cli_commands.md for:

  • Key bqetl CLI commands for query creation, validation, schema updates
  • How to find the right DAG for scheduling
  • Backfill creation commands

Best Practices

General principles:

  • Always include field descriptions in schema.yaml (see metadata-manager skill)
  • Add header comments explaining query purpose (see query-writer skill)
  • Reference bug/ticket numbers for context
  • Document any data exclusions or filtering logic

See assets/query_structure_example.sql for standard query structure.

Version migration:

  • Create new _v2 table when making breaking schema changes
  • Keep _v1 running during migration period
  • Update views to point to new version
  • Coordinate with downstream consumers before deprecating old version

For detailed best practices, see:

Integration with Other Skills

bigquery-etl-core serves as the foundation skill that other skills build upon:

Works with model-requirements

  • Provides naming conventions for new tables and datasets
  • Supplies common field naming patterns for requirements gathering
  • Offers privacy guidelines for data model planning

Works with query-writer

  • Provides project structure and naming conventions
  • Supplies common patterns and mozfun UDF references
  • Offers schema description lookup guidance for construction

Works with metadata-manager

  • Provides DAG naming patterns and scheduling conventions
  • Supplies partitioning and clustering best practices
  • Offers ownership and labeling patterns

Works with sql-test-generator

  • Provides test structure and fixture naming conventions
  • Supplies common table patterns for test creation
  • Offers query parameter conventions

Works with bigconfig-generator

  • Provides table naming conventions for Bigeye monitoring configuration
  • Supplies dataset organization patterns
  • Offers field naming standards for data quality checks

This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.

Reference Examples

Real query examples in the repository:

  • Simple query: sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql
  • Aggregation with GROUP BY: sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql
  • Complex query with CTEs: sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql
  • Python ETL (INFORMATION_SCHEMA): sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py
  • Python ETL (External API): sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py

For more examples, explore the sql/moz-fx-data-shared-prod/ directory.

Bundled Resources

References

  • references/discovery_resources.md - Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation links
  • references/naming_conventions.md - Complete naming patterns for tables, fields, and projects
  • references/dataset_naming_conventions.md - Dataset organization and versioning patterns
  • references/privacy_guidelines.md - Mozilla data privacy policies and best practices
  • references/glean_overview.md - Glean SDK concepts and BigQuery dataset structures
  • references/bqetl_cli_commands.md - Key CLI commands and DAG discovery

DataHub Usage (CRITICAL for Token Efficiency)

BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:

  • READ references/datahub_best_practices.md - Comprehensive token optimization strategies
  • Follow priority order: local files → documentation → DataHub (only as last resort)
  • Use search-first patterns and extract minimal fields from responses

Assets

  • assets/query_structure_example.sql - Standard query.sql structure with common patterns
  • assets/directory_structure_example.txt - File organization examples

Expand your agent's capabilities with these related and highly-rated skills.

mozilla/bigquery-etl-skills

model-requirements

Use this skill when gathering requirements for new BigQuery data models OR when asked to edit existing queries in bqetl. For new models, guides structured requirements interviews. For existing queries, understands current model, checks downstream dependencies, and gathers requirements for changes. Works as pre-planning before query-writer skill.

6 1
Explore
mozilla/bigquery-etl-skills

metadata-manager

Use this skill when creating or updating DAG configurations (dags.yaml), schema.yaml, and metadata.yaml files for BigQuery tables. Handles creating new DAGs when needed and coordinates test updates when queries are modified (invokes sql-test-generator as needed). Works with bigquery-etl-core, query-writer, and sql-test-generator skills.

6 1
Explore
mozilla/bigquery-etl-skills

bigconfig-generator

Use this skill when creating or updating Bigeye monitoring configurations (bigconfig.yml files) for BigQuery tables. Works with metadata-manager skill.

6 1
Explore
mozilla/bigquery-etl-skills

query-writer

Use this skill when writing or updating SQL queries (query.sql) or Python ETL scripts (query.py) following Mozilla BigQuery ETL conventions. ALWAYS checks for and updates existing tests when modifying queries. Coordinates downstream updates to schemas and tests. Works with bigquery-etl-core, metadata-manager, and sql-test-generator skills.

6 1
Explore
mozilla/bigquery-etl-skills

schema-readme-generator

Use this skill to create or update README.md files for BigQuery ETL tables in the mozilla bigquery-etl repository. Follows layout conventions derived from comparing README files across the repo — rich style with emoji headings, Mermaid data flow diagram, graduated example queries, and concise metadata overview table. Requires schema.yaml with complete descriptions (run schema-enricher first if needed) and a complete metadata.yaml.

6 1
Explore
mozilla/bigquery-etl-skills

sql-test-generator

ALWAYS use this skill when users ask to create, generate, or write UNIT TESTS for BigQuery SQL queries. Invoke proactively whenever the request includes "test" or "tests" with a query/table name. This skill is for unit testing ONLY (not data quality checks - use bigconfig-generator for Bigeye monitoring). Works with bigquery-etl-core skill to understand query patterns.

6 1
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results