Taxonomy Architect

Design, maintain, and evolve classification systems that power job matching, skill analysis, and company categorization. Ensure taxonomies are precise, consistent, and extensible toward future semantic search capabilities.

When to Use This Skill

Trigger when user asks to:

Define or refine job role categories (families, subfamilies)
Create or update skill taxonomies
Classify ambiguous roles or edge cases
Design company categorization schemes
Plan ontology structures for semantic search
Resolve classification conflicts or inconsistencies
Evaluate taxonomy coverage and gaps
Prepare embeddings or vector search strategies

Core Principles

1. Mutually Exclusive, Collectively Exhaustive (MECE)

Categories at the same level should not overlap, and together should cover all cases.

BAD:                          GOOD:
├── Data Analyst              ├── Data Analyst
├── Business Analyst          ├── Analytics Engineer
├── Analytics (overlap!)      ├── Data Engineer
└── BI Developer (overlap!)   └── Data Scientist

2. User Mental Model Alignment

Categories should match how practitioners describe themselves, not internal corporate structures.

BAD (org-chart thinking):     GOOD (practitioner thinking):
├── Engineering               ├── Data Engineer
│   └── Data                  ├── ML Engineer
├── Analytics                 ├── Analytics Engineer
│   └── Data                  └── Data Analyst
└── Science
    └── Data

3. Stable Core, Flexible Edges

Core categories should be stable over time. Edge cases and emerging roles should be handled without restructuring the core.

STABLE CORE:                  EDGE HANDLING:
├── Data Engineer             "AI Engineer" → classify as:
├── ML Engineer               - ML Engineer (if model-focused)
├── Data Scientist            - out_of_scope (if API integration)
└── Analytics Engineer        Document decision, revisit quarterly

4. Evidence-Based Boundaries

Category boundaries should be defined by observable signals in job postings, not assumptions.

Signal Type	Examples
Title patterns	"Analytics Engineer" vs "Data Analyst"
Tool requirements	dbt, Airflow, Spark → Data/Analytics Engineer
Responsibility keywords	"build pipelines" vs "create dashboards"
Team placement	"Data Platform team" vs "Business Intelligence"
Seniority markers	"Principal", "Staff", "Lead", "Senior", "Junior"

5. Semantic Readiness

Design with future embedding/vector search in mind. Categories should be describable in natural language that captures semantic meaning.

GOOD (embeddable description):
"Analytics Engineer: Builds and maintains data transformation 
pipelines using tools like dbt, creates metrics layers and 
semantic models, bridges raw data and analyst-ready datasets."

BAD (list of keywords):
"Analytics Engineer: dbt, SQL, data modeling, metrics"

Current Taxonomy (v1.5)

Job Families & Subfamilies

yaml

job_families:
  product:
    description: "Roles focused on product strategy, discovery, and delivery"
    subfamilies:
      core_pm:
        label: "Core PM"
        description: "General product management for user-facing features"
        signals:
          titles: ["Product Manager", "PM", "Product Lead"]
          keywords: ["roadmap", "user stories", "stakeholders", "prioritization"]
          anti_signals: ["growth", "platform", "API", "ML", "AI"]
        
      growth_pm:
        label: "Growth PM"
        description: "Acquisition, retention, monetization, conversion optimization"
        signals:
          titles: ["Growth PM", "Growth Product Manager"]
          keywords: ["acquisition", "retention", "conversion", "funnel", "experimentation", "A/B testing"]
          
      platform_pm:
        label: "Platform PM"
        description: "Developer tools, APIs, infrastructure products"
        signals:
          titles: ["Platform PM", "API PM", "Developer Experience PM"]
          keywords: ["API", "SDK", "developer", "platform", "infrastructure", "internal tools"]
          
      technical_pm:
        label: "Technical PM"
        description: "Deep technical skills required, often ex-engineers"
        signals:
          titles: ["Technical Product Manager", "TPM"]
          keywords: ["technical requirements", "engineering background", "system design"]
          
      ai_ml_pm:
        label: "AI/ML PM"
        description: "AI/ML products, models, data products"
        signals:
          titles: ["AI PM", "ML PM", "AI Product Manager", "Data Product Manager"]
          keywords: ["machine learning", "AI", "model", "LLM", "GenAI", "data product"]

  data:
    description: "Roles focused on data infrastructure, analysis, and machine learning"
    subfamilies:
      product_analytics:
        label: "Product Analytics"
        description: "Product metrics, experiments, user behavior, growth analytics"
        signals:
          titles: ["Product Analyst", "Growth Analyst", "Product Data Analyst"]
          keywords: ["product metrics", "experimentation", "user behavior", "funnel analysis", "Amplitude", "Mixpanel"]
          anti_signals: ["pipeline", "infrastructure", "model training"]
          
      data_analyst:
        label: "Data Analyst"
        description: "Business reporting, dashboards, SQL analysis, BI tools"
        signals:
          titles: ["Data Analyst", "Business Analyst", "BI Analyst", "Reporting Analyst"]
          keywords: ["dashboards", "reporting", "Tableau", "Power BI", "Looker", "business intelligence"]
          anti_signals: ["dbt", "pipeline", "modeling layer"]
          
      analytics_engineer:
        label: "Analytics Engineer"
        description: "dbt, metrics layer, data modeling, semantic layer"
        signals:
          titles: ["Analytics Engineer", "Data Modeling Engineer"]
          keywords: ["dbt", "data modeling", "metrics layer", "semantic layer", "transformation"]
          disambiguate_from: ["data_analyst", "data_engineer"]
          
      data_engineer:
        label: "Data Engineer"
        description: "Pipelines, infrastructure, ETL/ELT, big data"
        signals:
          titles: ["Data Engineer", "ETL Developer", "Data Platform Engineer"]
          keywords: ["pipeline", "Airflow", "Spark", "ETL", "ELT", "data infrastructure", "Kafka"]
          
      ml_engineer:
        label: "ML Engineer"
        description: "Production ML systems, MLOps, includes LLM/GenAI implementation"
        signals:
          titles: ["ML Engineer", "Machine Learning Engineer", "MLOps Engineer", "AI Engineer"]
          keywords: ["model deployment", "MLOps", "feature store", "model serving", "LLM", "fine-tuning"]
          notes: "AI Engineer roles classify here if model-focused; out_of_scope if primarily API integration"
          
      data_scientist:
        label: "Data Scientist"
        description: "Statistical modeling, predictions, business insights"
        signals:
          titles: ["Data Scientist", "Senior Data Scientist", "Applied Scientist"]
          keywords: ["statistical modeling", "prediction", "regression", "classification", "causal inference"]
          disambiguate_from: ["ml_engineer", "research_scientist"]
          
      research_scientist:
        label: "Research Scientist (ML/AI)"
        description: "Novel ML research, publications, pushing state-of-the-art"
        signals:
          titles: ["Research Scientist", "ML Researcher", "AI Researcher"]
          keywords: ["publications", "novel", "state-of-the-art", "research", "PhD"]
          
      data_architect:
        label: "Data Architect"
        description: "Data strategy, governance, platform design"
        signals:
          titles: ["Data Architect", "Enterprise Data Architect", "Data Governance Lead"]
          keywords: ["data strategy", "governance", "data catalog", "metadata", "architecture"]

Seniority Levels

yaml

seniority:
  junior:
    label: "Junior"
    signals:
      titles: ["Junior", "Jr", "Associate", "Entry Level", "Graduate"]
      experience: ["0-2 years", "entry level", "new grad"]
      
  mid:
    label: "Mid-Level"
    signals:
      titles: ["Data Engineer", "Product Manager"] # No prefix = usually mid
      experience: ["2-5 years", "3+ years"]
      
  senior:
    label: "Senior"
    signals:
      titles: ["Senior", "Sr", "Lead"]
      experience: ["5+ years", "7+ years"]
      
  staff_plus:
    label: "Staff+"
    signals:
      titles: ["Staff", "Principal", "Distinguished", "Architect", "Director"]
      experience: ["10+ years", "extensive experience"]

Working Arrangement

yaml

working_arrangement:
  onsite:
    label: "Onsite"
    signals: ["on-site", "in-office", "office-based", "in-person"]
    
  hybrid:
    label: "Hybrid"
    signals: ["hybrid", "flexible", "2-3 days in office", "partial remote"]
    
  remote:
    label: "Remote"
    signals: ["remote", "work from home", "distributed", "anywhere"]
    qualifiers: ["remote (US only)", "remote (timezone restricted)"]

Skills Taxonomy

Structure

yaml

skills:
  parent_categories:
    product:
      label: "Product Skills"
      families:
        - discovery_research
        - execution_delivery
        - experimentation
        - analytics_pm
        - stakeholder_mgmt
        
    data_ml:
      label: "Data/ML Skills"
      families:
        - programming
        - analytics_stats
        - classical_ml
        - deep_learning
        - llm_genai
        - big_data
        - pipelines_orchestration
        - data_modeling
        - warehouses_lakes
        - mlops
        - cloud
        - streaming
        - visualization
        
    platform_infra:
      label: "Platform/Infra Skills"
      families:
        - deployment
        - infrastructure_code
        - ci_cd
        - monitoring

Skill Family Details

yaml

data_ml:
  programming:
    label: "Programming Languages"
    skills: ["Python", "R", "SQL", "Scala", "Java", "Julia"]
    notes: "SQL is both a language and a skill; always extract"
    
  analytics_stats:
    label: "Analytics & Statistics"
    skills: ["Statistics", "Probability", "Regression", "Causal inference", 
             "Time series", "Hypothesis testing", "Bayesian analysis"]
             
  classical_ml:
    label: "Classical Machine Learning"
    skills: ["Scikit-learn", "XGBoost", "LightGBM", "Random Forest",
             "Logistic regression", "SVM", "Feature engineering"]
             
  deep_learning:
    label: "Deep Learning"
    skills: ["PyTorch", "TensorFlow", "Keras", "Neural networks",
             "CNNs", "RNNs", "Computer vision", "NLP"]
             
  llm_genai:
    label: "LLM/GenAI"
    skills: ["LLMs", "Transformers", "GPT", "BERT", "Claude",
             "Prompt engineering", "RAG", "Vector databases",
             "LangChain", "Embeddings", "Fine-tuning"]
    notes: "Fast-evolving category; review quarterly"
    
  big_data:
    label: "Big Data Processing"
    skills: ["Spark", "PySpark", "Hadoop", "Hive", "Presto", "Flink"]
    
  pipelines_orchestration:
    label: "Pipelines & Orchestration"
    skills: ["Airflow", "Dagster", "Prefect", "Luigi",
             "Data pipelines", "ETL", "ELT"]
             
  data_modeling:
    label: "Data Modeling"
    skills: ["dbt", "Data modeling", "Dimensional modeling",
             "Star schema", "Data warehouse design"]
             
  warehouses_lakes:
    label: "Warehouses & Lakes"
    skills: ["Snowflake", "BigQuery", "Redshift", "Databricks",
             "Athena", "Delta Lake", "Data lake"]
             
  mlops:
    label: "MLOps"
    skills: ["MLflow", "Kubeflow", "Model serving", "Model monitoring",
             "Feature stores", "Model registry", "Weights & Biases"]
             
  cloud:
    label: "Cloud Platforms"
    skills: ["AWS", "GCP", "Azure", "S3", "EC2", "Lambda", "Cloud Functions"]
    
  streaming:
    label: "Streaming"
    skills: ["Kafka", "Kinesis", "Pub/Sub", "Real-time processing"]
    
  visualization:
    label: "Data Visualization"
    skills: ["Tableau", "Power BI", "Looker", "Metabase",
             "Plotly", "Matplotlib", "Seaborn"]

Company/Employer Taxonomy

System of Record: docs/schema_taxonomy.yaml (see enums.employer_industry)

Employer Industry (20 Domain-Focused Categories)

Design Decision: These are industry VERTICALS, not business models. "B2B SaaS" was intentionally excluded - it's a business model that spans multiple industries. A company like Stripe is fintech even though it sells B2B SaaS.

Code	Label	Examples
`fintech`	FinTech	Stripe, Monzo, Affirm, Plaid
`healthtech`	HealthTech	Flatiron, Omada, Oscar
`ecommerce`	E-commerce & Marketplace	Instacart, Deliveroo, Etsy
`ai_ml`	AI/ML	OpenAI, Anthropic, Harvey AI
`consumer`	Consumer Tech	Spotify, Reddit, Strava
`mobility`	Mobility & Logistics	Uber, Waymo, Zipline
`proptech`	PropTech	Airbnb, Zillow, CoStar
`edtech`	EdTech	Coursera, Duolingo
`climate`	Climate Tech	Watershed, Crusoe
`crypto`	Crypto & Web3	Coinbase, Kraken
`devtools`	Developer Tools	GitHub, Vercel, Linear
`data_infra`	Data Infrastructure	Snowflake, Databricks, dbt Labs
`cybersecurity`	Cybersecurity	Okta, Vanta, 1Password
`hr_tech`	HR Tech	Rippling, Gusto, Deel
`martech`	Marketing Tech	Braze, Amplitude, HubSpot
`professional_services`	Professional Services	Deloitte, Accenture
`productivity`	Productivity & Collaboration	Notion, Asana, Airtable, Calendly
`hardware`	Hardware & Robotics	Apple, Gecko Robotics
`other`	Other	Catch-all

Employer Size

Code	Label	Signals
`startup`	Startup (1-50)	seed, series A, early stage
`scaleup`	Scale-up (51-500)	series B/C, growth stage
`enterprise`	Enterprise (500+)	public, Fortune 500, established

Multi-Industry Companies

Some companies span multiple industries. Classification rules:

Single primary industry - Each company gets ONE industry value (MECE)
Classify by core product/revenue - Stripe is fintech (payments), not devtools
For conglomerates - Classify by the division most relevant to the job posting

Company	Industry	Rationale
Stripe	`fintech`	Core is payments, even though they have dev tools
Uber	`mobility`	Core is transportation
Airbnb	`proptech`	Real estate marketplace
Amazon (AWS jobs)	`devtools` or `data_infra`	Depends on specific role

Edge Case Resolution

Decision Framework

When encountering ambiguous roles:

1. Check title patterns against known signals
2. Analyze job description for disambiguating keywords
3. Look at team/department placement
4. Consider required tools/skills
5. Apply "where would the practitioner self-identify?" test
6. If still ambiguous, document and classify to best fit
7. Flag for quarterly taxonomy review

Documented Edge Cases

Role Pattern	Decision	Rationale
"AI Engineer"	ML Engineer OR out_of_scope	If model-focused → ML Engineer; if API integration only → out_of_scope
"Data Analyst" with dbt	Analytics Engineer	dbt is strong signal for AE over DA
"Business Intelligence Engineer"	Data Analyst	Despite "engineer" title, typically dashboard/reporting focused
"Applied Scientist"	Data Scientist	Amazon-specific title; responsibilities align with DS
"Product Analyst"	Product Analytics	Distinct from generic Data Analyst by product focus
"Growth Engineer"	out_of_scope	Engineering role, not data/product
"Technical Program Manager"	out_of_scope	Program management, not product management

Geographic Variations

Term	US Meaning	UK Meaning	Resolution
"Data Scientist"	Often ML-heavy	Sometimes more analytics	Check for ML signals
"Analyst"	Entry-level connotation	Can be senior	Use seniority signals

Ontology Design (Future: Semantic Search)

Current State: Taxonomy

Hierarchical classification
├── Fixed categories
├── Rule-based assignment
└── Exact match on signals

Future State: Ontology

Semantic network
├── Entities with relationships
├── Embedding-based similarity
├── Natural language queries
└── Fuzzy matching with confidence

Preparation Steps

1. Rich Entity Descriptions

Every category needs a natural language description suitable for embedding:

yaml

analytics_engineer:
  embedding_description: |
    An Analytics Engineer builds and maintains the data transformation 
    layer between raw data sources and analyst-ready datasets. They 
    typically work with tools like dbt to create reusable data models, 
    define business metrics in a semantic layer, and ensure data quality 
    through testing. They bridge the gap between Data Engineers who 
    build pipelines and Data Analysts who consume clean data.
    
    Related roles: Data Analyst, Data Engineer, BI Developer
    Key differentiator: Focuses on transformation and modeling, not 
    pipeline infrastructure or end-user dashboards.

2. Relationship Types

yaml

relationships:
  is_a:
    description: "Hierarchical parent-child"
    example: "Analytics Engineer IS_A Data Role"
    
  related_to:
    description: "Conceptually similar, often confused"
    example: "Analytics Engineer RELATED_TO Data Analyst"
    
  requires_skill:
    description: "Role typically requires this skill"
    example: "Analytics Engineer REQUIRES_SKILL dbt"
    
  collaborates_with:
    description: "Roles that frequently work together"
    example: "Analytics Engineer COLLABORATES_WITH Data Scientist"
    
  progression_to:
    description: "Common career progression"
    example: "Data Analyst PROGRESSION_TO Analytics Engineer"

3. Embedding Strategy

yaml

embedding_approach:
  model: "text-embedding-3-small" # or similar
  
  what_to_embed:
    - role descriptions (paragraph form)
    - skill descriptions
    - job posting titles + first 500 chars
    
  similarity_thresholds:
    high_confidence: 0.85
    medium_confidence: 0.70
    needs_review: 0.50
    
  use_cases:
    - "Find roles similar to Analytics Engineer"
    - "What skills are adjacent to dbt?"
    - "Candidates with X skills might fit Y roles"

4. Query Patterns (Future)

Natural language queries the ontology should support:

"Show me roles that are like Data Scientist but more engineering-focused"
→ Returns: ML Engineer, Analytics Engineer

"What skills should a Data Analyst learn to become an Analytics Engineer?"
→ Returns: dbt, data modeling, SQL (advanced), Git

"Find companies where Analytics Engineers report to Engineering not Analytics"
→ Returns: [requires company org data]

"Which roles commonly transition to Product Management?"
→ Returns: Data Analyst, Product Analytics, Data Scientist

Taxonomy Maintenance

Review Cadence

Review Type	Frequency	Focus
Edge case log review	Weekly	Resolve accumulated ambiguities
Coverage analysis	Monthly	Identify gaps, new role patterns
Signal effectiveness	Monthly	Which signals are predictive?
Full taxonomy review	Quarterly	Add/remove/restructure categories
Skill taxonomy update	Quarterly	New tools, deprecated skills

Metrics to Track

Metric	Target	Action if Below
Classification confidence (avg)	>0.85	Review low-confidence patterns
out_of_scope rate	<15%	Consider new categories
Edge case backlog	<20 unresolved	Schedule resolution session
Reclassification rate	<5%	Investigate unstable categories

Change Log Template

markdown

## Taxonomy Change Log

### [Date] - v1.X.X

**Added:**
- [New category/skill] - Rationale: [why needed]

**Changed:**
- [Category] - [What changed] - Rationale: [why]

**Removed:**
- [Category/skill] - Rationale: [why deprecated]

**Edge Cases Resolved:**
- [Role pattern] → Now classifies as [category]

**Open Questions:**
- [Unresolved issue for next review]

Output Formats

Classification Decision

markdown

## Classification: [Job Title]

**Input:** [Raw title and key description excerpts]

**Decision:**
- Family: [product/data]
- Subfamily: [specific category]
- Seniority: [level]
- Confidence: [high/medium/low]

**Signals Found:**
- Title: [matching patterns]
- Keywords: [matching terms]
- Tools: [specific tools mentioned]

**Disambiguation Notes:**
[If edge case, explain reasoning]

**Flags:**
- [ ] Needs human review
- [ ] New pattern for taxonomy consideration

Taxonomy Gap Analysis

markdown

## Gap Analysis: [Date]

**Coverage Summary:**
- Total roles analyzed: [N]
- Successfully classified: [N] ([%])
- Out of scope: [N] ([%])
- Low confidence: [N] ([%])

**Emerging Patterns:**
| Pattern | Frequency | Suggested Action |
|---------|-----------|------------------|
| [New title pattern] | [N] | [Add category / Add signal / Monitor] |

**Problem Categories:**
| Category | Issue | Recommendation |
|----------|-------|----------------|
| [Category] | [High confusion rate with X] | [Improve signals / Merge / Split] |

**Skill Gaps:**
- [New skills appearing frequently but not in taxonomy]

**Recommendations:**
1. [Specific change]
2. [Specific change]

Integration Points

With Classifier (Claude Haiku)

The taxonomy informs the classification prompt:

python

TAXONOMY_CONTEXT = """
Valid subfamilies for Data roles:
- product_analytics: Product metrics, experiments, user behavior
- data_analyst: Business reporting, dashboards, BI tools
- analytics_engineer: dbt, metrics layer, data modeling
- data_engineer: Pipelines, ETL/ELT, data infrastructure
- ml_engineer: Production ML, MLOps, model deployment
- data_scientist: Statistical modeling, predictions
- research_scientist: Novel ML research, publications
- data_architect: Data strategy, governance

Classification rules:
- "AI Engineer" → ml_engineer if model-focused, else out_of_scope
- Presence of "dbt" strongly indicates analytics_engineer
- "Business Analyst" → data_analyst unless product-focused
"""

With Job Feed (Filtering)

Taxonomy enables precise filtering:

sql

-- User selects "Analytics Engineer" 
-- Only returns exact subfamily match, not "Data Analyst"
WHERE job_subfamily = 'analytics_engineer'

-- User selects "Data" family
-- Returns all data subfamilies
WHERE job_family = 'data'

With Semantic Search (Future)

python

# Current: exact match
results = db.query("subfamily = 'analytics_engineer'")

# Future: semantic similarity
query_embedding = embed("data transformation and metrics modeling role")
results = vector_search(query_embedding, threshold=0.8)
# Returns: analytics_engineer, data_engineer (lower score)

Search AI Tools

Install this agent skill to your Project

SKILL.md

Taxonomy Architect

When to Use This Skill

Core Principles

1. Mutually Exclusive, Collectively Exhaustive (MECE)

2. User Mental Model Alignment

3. Stable Core, Flexible Edges

4. Evidence-Based Boundaries

5. Semantic Readiness

Current Taxonomy (v1.5)

Job Families & Subfamilies

Seniority Levels

Working Arrangement

Skills Taxonomy

Structure

Skill Family Details

Company/Employer Taxonomy

Employer Industry (20 Domain-Focused Categories)

Employer Size

Multi-Industry Companies

Edge Case Resolution

Decision Framework

Documented Edge Cases

Geographic Variations

Ontology Design (Future: Semantic Search)

Current State: Taxonomy

Future State: Ontology

Preparation Steps

Taxonomy Maintenance

Review Cadence

Metrics to Track

Change Log Template

Output Formats

Classification Decision

Taxonomy Gap Analysis

Integration Points

With Classifier (Claude Haiku)

With Job Feed (Filtering)

With Semantic Search (Future)