Agent skill
tracing-upstream-lineage
Trace upstream data lineage. Use when the user asks where data comes from, what feeds a table, upstream dependencies, data sources, or needs to understand data origins.
Install this agent skill to your Project
npx add-skill https://github.com/astronomer/agents/tree/main/skills/tracing-upstream-lineage
SKILL.md
Upstream Lineage: Sources
Trace the origins of data - answer "Where does this data come from?"
Lineage Investigation
Step 1: Identify the Target Type
Determine what we're tracing:
- Table: Trace what populates this table
- Column: Trace where this specific column comes from
- DAG: Trace what data sources this DAG reads from
Step 2: Find the Producing DAG
Tables are typically populated by Airflow DAGs. Find the connection:
-
Search DAGs by name: Use
af dags listand look for DAG names matching the table nameload_customers->customerstableetl_daily_orders->orderstable
-
Explore DAG source code: Use
af dags source <dag_id>to read the DAG definition- Look for INSERT, MERGE, CREATE TABLE statements
- Find the target table in the code
-
Check DAG tasks: Use
af tasks list <dag_id>to see what operations the DAG performs
On Astro
If you're running on Astro, the Lineage tab in the Astro UI provides visual lineage exploration across DAGs and datasets. Use it to quickly trace upstream dependencies without manually searching DAG source code.
On OSS Airflow
Use DAG source code and task logs to trace lineage (no built-in cross-DAG UI).
Step 3: Trace Data Sources
From the DAG code, identify source tables and systems:
SQL Sources (look for FROM clauses):
# In DAG code:
SELECT * FROM source_schema.source_table # <- This is an upstream source
External Sources (look for connection references):
S3Operator-> S3 bucket sourcePostgresOperator-> Postgres database sourceSalesforceOperator-> Salesforce API sourceHttpOperator-> REST API source
File Sources:
- CSV/Parquet files in object storage
- SFTP drops
- Local file paths
Step 4: Build the Lineage Chain
Recursively trace each source:
TARGET: analytics.orders_daily
^
+-- DAG: etl_daily_orders
^
+-- SOURCE: raw.orders (table)
| ^
| +-- DAG: ingest_orders
| ^
| +-- SOURCE: Salesforce API (external)
|
+-- SOURCE: dim.customers (table)
^
+-- DAG: load_customers
^
+-- SOURCE: PostgreSQL (external DB)
Step 5: Check Source Health
For each upstream source:
- Tables: Check freshness with the checking-freshness skill
- DAGs: Check recent run status with
af dags stats - External systems: Note connection info from DAG code
Lineage for Columns
When tracing a specific column:
- Find the column in the target table schema
- Search DAG source code for references to that column name
- Trace through transformations:
- Direct mappings:
source.col AS target_col - Transformations:
COALESCE(a.col, b.col) AS target_col - Aggregations:
SUM(detail.amount) AS total_amount
- Direct mappings:
Output: Lineage Report
Summary
One-line answer: "This table is populated by DAG X from sources Y and Z"
Lineage Diagram
[Salesforce] --> [raw.opportunities] --> [stg.opportunities] --> [fct.sales]
| |
DAG: ingest_sfdc DAG: transform_sales
Source Details
| Source | Type | Connection | Freshness | Owner |
|---|---|---|---|---|
| raw.orders | Table | Internal | 2h ago | data-team |
| Salesforce | API | salesforce_conn | Real-time | sales-ops |
Transformation Chain
Describe how data flows and transforms:
- Raw data lands in
raw.ordersvia Salesforce API sync - DAG
transform_orderscleans and dedupes intostg.orders - DAG
build_order_factsjoins with dimensions intofct.orders
Data Quality Implications
- Single points of failure?
- Stale upstream sources?
- Complex transformation chains that could break?
Related Skills
- Check source freshness: checking-freshness skill
- Debug source DAG: debugging-dags skill
- Trace downstream impacts: tracing-downstream-lineage skill
- Add manual lineage annotations: annotating-task-lineage skill
- Build custom lineage extractors: creating-openlineage-extractors skill
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
testing-dags
Complex DAG testing workflows with debugging and fixing cycles. Use for multi-step testing requests like "test this dag and fix it if it fails", "test and debug", "run the pipeline and troubleshoot issues". For simple test requests ("test dag", "run dag"), the airflow entrypoint skill handles it directly. This skill is for iterative test-debug-fix cycles.
managing-astro-local-env
Manage local Airflow environment with Astro CLI. Use when the user wants to start, stop, or restart Airflow, view logs, troubleshoot containers, or fix environment issues. For project setup, see setting-up-astro-project.
analyzing-data
Queries data warehouse and answers business questions about data. Handles questions requiring database/warehouse queries including "who uses X", "how many Y", "show me Z", "find customers", "what is the count", data lookups, metrics, trends, or SQL analysis.
setting-up-astro-project
Initialize and configure Astro/Airflow projects. Use when the user wants to create a new project, set up dependencies, configure connections/variables, or understand project structure. For running the local environment, see managing-astro-local-env.
airflow-plugins
Build Airflow 3.1+ plugins that embed FastAPI apps, custom UI pages, React components, middleware, macros, and operator links directly into the Airflow UI. Use this skill whenever the user wants to create an Airflow plugin, add a custom UI page or nav entry to Airflow, build FastAPI-backed endpoints inside Airflow, serve static assets from a plugin, embed a React app in the Airflow UI, add middleware to the Airflow API server, create custom operator extra links, or call the Airflow REST API from inside a plugin. Also trigger when the user mentions AirflowPlugin, fastapi_apps, external_views, react_apps, plugin registration, or embedding a web app in Airflow 3.1+. If someone is building anything custom inside Airflow 3.1+ that involves Python and a browser-facing interface, this skill almost certainly applies.
warehouse-init
Initialize warehouse schema discovery. Generates .astro/warehouse.md with all table metadata for instant lookups. Run once per project, refresh when schema changes. Use when user says "/astronomer-data:warehouse-init" or asks to set up data discovery.
Didn't find tool you were looking for?