Agent skill

taxonomy-architect

Design and maintain classification systems for jobs, skills, and companies. Use when defining categories, resolving edge cases, planning ontology structures, or preparing for semantic search capabilities.

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/taxonomy-architect

SKILL.md

Taxonomy Architect

Design, maintain, and evolve classification systems that power job matching, skill analysis, and company categorization. Ensure taxonomies are precise, consistent, and extensible toward future semantic search capabilities.

When to Use This Skill

Trigger when user asks to:

  • Define or refine job role categories (families, subfamilies)
  • Create or update skill taxonomies
  • Classify ambiguous roles or edge cases
  • Design company categorization schemes
  • Plan ontology structures for semantic search
  • Resolve classification conflicts or inconsistencies
  • Evaluate taxonomy coverage and gaps
  • Prepare embeddings or vector search strategies

Core Principles

1. Mutually Exclusive, Collectively Exhaustive (MECE)

Categories at the same level should not overlap, and together should cover all cases.

BAD:                          GOOD:
├── Data Analyst              ├── Data Analyst
├── Business Analyst          ├── Analytics Engineer
├── Analytics (overlap!)      ├── Data Engineer
└── BI Developer (overlap!)   └── Data Scientist

2. User Mental Model Alignment

Categories should match how practitioners describe themselves, not internal corporate structures.

BAD (org-chart thinking):     GOOD (practitioner thinking):
├── Engineering               ├── Data Engineer
│   └── Data                  ├── ML Engineer
├── Analytics                 ├── Analytics Engineer
│   └── Data                  └── Data Analyst
└── Science
    └── Data

3. Stable Core, Flexible Edges

Core categories should be stable over time. Edge cases and emerging roles should be handled without restructuring the core.

STABLE CORE:                  EDGE HANDLING:
├── Data Engineer             "AI Engineer" → classify as:
├── ML Engineer               - ML Engineer (if model-focused)
├── Data Scientist            - out_of_scope (if API integration)
└── Analytics Engineer        Document decision, revisit quarterly

4. Evidence-Based Boundaries

Category boundaries should be defined by observable signals in job postings, not assumptions.

Signal Type Examples
Title patterns "Analytics Engineer" vs "Data Analyst"
Tool requirements dbt, Airflow, Spark → Data/Analytics Engineer
Responsibility keywords "build pipelines" vs "create dashboards"
Team placement "Data Platform team" vs "Business Intelligence"
Seniority markers "Principal", "Staff", "Lead", "Senior", "Junior"

5. Semantic Readiness

Design with future embedding/vector search in mind. Categories should be describable in natural language that captures semantic meaning.

GOOD (embeddable description):
"Analytics Engineer: Builds and maintains data transformation 
pipelines using tools like dbt, creates metrics layers and 
semantic models, bridges raw data and analyst-ready datasets."

BAD (list of keywords):
"Analytics Engineer: dbt, SQL, data modeling, metrics"

Current Taxonomy (v1.5)

Job Families & Subfamilies

yaml
job_families:
  product:
    description: "Roles focused on product strategy, discovery, and delivery"
    subfamilies:
      core_pm:
        label: "Core PM"
        description: "General product management for user-facing features"
        signals:
          titles: ["Product Manager", "PM", "Product Lead"]
          keywords: ["roadmap", "user stories", "stakeholders", "prioritization"]
          anti_signals: ["growth", "platform", "API", "ML", "AI"]
        
      growth_pm:
        label: "Growth PM"
        description: "Acquisition, retention, monetization, conversion optimization"
        signals:
          titles: ["Growth PM", "Growth Product Manager"]
          keywords: ["acquisition", "retention", "conversion", "funnel", "experimentation", "A/B testing"]
          
      platform_pm:
        label: "Platform PM"
        description: "Developer tools, APIs, infrastructure products"
        signals:
          titles: ["Platform PM", "API PM", "Developer Experience PM"]
          keywords: ["API", "SDK", "developer", "platform", "infrastructure", "internal tools"]
          
      technical_pm:
        label: "Technical PM"
        description: "Deep technical skills required, often ex-engineers"
        signals:
          titles: ["Technical Product Manager", "TPM"]
          keywords: ["technical requirements", "engineering background", "system design"]
          
      ai_ml_pm:
        label: "AI/ML PM"
        description: "AI/ML products, models, data products"
        signals:
          titles: ["AI PM", "ML PM", "AI Product Manager", "Data Product Manager"]
          keywords: ["machine learning", "AI", "model", "LLM", "GenAI", "data product"]

  data:
    description: "Roles focused on data infrastructure, analysis, and machine learning"
    subfamilies:
      product_analytics:
        label: "Product Analytics"
        description: "Product metrics, experiments, user behavior, growth analytics"
        signals:
          titles: ["Product Analyst", "Growth Analyst", "Product Data Analyst"]
          keywords: ["product metrics", "experimentation", "user behavior", "funnel analysis", "Amplitude", "Mixpanel"]
          anti_signals: ["pipeline", "infrastructure", "model training"]
          
      data_analyst:
        label: "Data Analyst"
        description: "Business reporting, dashboards, SQL analysis, BI tools"
        signals:
          titles: ["Data Analyst", "Business Analyst", "BI Analyst", "Reporting Analyst"]
          keywords: ["dashboards", "reporting", "Tableau", "Power BI", "Looker", "business intelligence"]
          anti_signals: ["dbt", "pipeline", "modeling layer"]
          
      analytics_engineer:
        label: "Analytics Engineer"
        description: "dbt, metrics layer, data modeling, semantic layer"
        signals:
          titles: ["Analytics Engineer", "Data Modeling Engineer"]
          keywords: ["dbt", "data modeling", "metrics layer", "semantic layer", "transformation"]
          disambiguate_from: ["data_analyst", "data_engineer"]
          
      data_engineer:
        label: "Data Engineer"
        description: "Pipelines, infrastructure, ETL/ELT, big data"
        signals:
          titles: ["Data Engineer", "ETL Developer", "Data Platform Engineer"]
          keywords: ["pipeline", "Airflow", "Spark", "ETL", "ELT", "data infrastructure", "Kafka"]
          
      ml_engineer:
        label: "ML Engineer"
        description: "Production ML systems, MLOps, includes LLM/GenAI implementation"
        signals:
          titles: ["ML Engineer", "Machine Learning Engineer", "MLOps Engineer", "AI Engineer"]
          keywords: ["model deployment", "MLOps", "feature store", "model serving", "LLM", "fine-tuning"]
          notes: "AI Engineer roles classify here if model-focused; out_of_scope if primarily API integration"
          
      data_scientist:
        label: "Data Scientist"
        description: "Statistical modeling, predictions, business insights"
        signals:
          titles: ["Data Scientist", "Senior Data Scientist", "Applied Scientist"]
          keywords: ["statistical modeling", "prediction", "regression", "classification", "causal inference"]
          disambiguate_from: ["ml_engineer", "research_scientist"]
          
      research_scientist:
        label: "Research Scientist (ML/AI)"
        description: "Novel ML research, publications, pushing state-of-the-art"
        signals:
          titles: ["Research Scientist", "ML Researcher", "AI Researcher"]
          keywords: ["publications", "novel", "state-of-the-art", "research", "PhD"]
          
      data_architect:
        label: "Data Architect"
        description: "Data strategy, governance, platform design"
        signals:
          titles: ["Data Architect", "Enterprise Data Architect", "Data Governance Lead"]
          keywords: ["data strategy", "governance", "data catalog", "metadata", "architecture"]

Seniority Levels

yaml
seniority:
  junior:
    label: "Junior"
    signals:
      titles: ["Junior", "Jr", "Associate", "Entry Level", "Graduate"]
      experience: ["0-2 years", "entry level", "new grad"]
      
  mid:
    label: "Mid-Level"
    signals:
      titles: ["Data Engineer", "Product Manager"] # No prefix = usually mid
      experience: ["2-5 years", "3+ years"]
      
  senior:
    label: "Senior"
    signals:
      titles: ["Senior", "Sr", "Lead"]
      experience: ["5+ years", "7+ years"]
      
  staff_plus:
    label: "Staff+"
    signals:
      titles: ["Staff", "Principal", "Distinguished", "Architect", "Director"]
      experience: ["10+ years", "extensive experience"]

Working Arrangement

yaml
working_arrangement:
  onsite:
    label: "Onsite"
    signals: ["on-site", "in-office", "office-based", "in-person"]
    
  hybrid:
    label: "Hybrid"
    signals: ["hybrid", "flexible", "2-3 days in office", "partial remote"]
    
  remote:
    label: "Remote"
    signals: ["remote", "work from home", "distributed", "anywhere"]
    qualifiers: ["remote (US only)", "remote (timezone restricted)"]

Skills Taxonomy

Structure

yaml
skills:
  parent_categories:
    product:
      label: "Product Skills"
      families:
        - discovery_research
        - execution_delivery
        - experimentation
        - analytics_pm
        - stakeholder_mgmt
        
    data_ml:
      label: "Data/ML Skills"
      families:
        - programming
        - analytics_stats
        - classical_ml
        - deep_learning
        - llm_genai
        - big_data
        - pipelines_orchestration
        - data_modeling
        - warehouses_lakes
        - mlops
        - cloud
        - streaming
        - visualization
        
    platform_infra:
      label: "Platform/Infra Skills"
      families:
        - deployment
        - infrastructure_code
        - ci_cd
        - monitoring

Skill Family Details

yaml
data_ml:
  programming:
    label: "Programming Languages"
    skills: ["Python", "R", "SQL", "Scala", "Java", "Julia"]
    notes: "SQL is both a language and a skill; always extract"
    
  analytics_stats:
    label: "Analytics & Statistics"
    skills: ["Statistics", "Probability", "Regression", "Causal inference", 
             "Time series", "Hypothesis testing", "Bayesian analysis"]
             
  classical_ml:
    label: "Classical Machine Learning"
    skills: ["Scikit-learn", "XGBoost", "LightGBM", "Random Forest",
             "Logistic regression", "SVM", "Feature engineering"]
             
  deep_learning:
    label: "Deep Learning"
    skills: ["PyTorch", "TensorFlow", "Keras", "Neural networks",
             "CNNs", "RNNs", "Computer vision", "NLP"]
             
  llm_genai:
    label: "LLM/GenAI"
    skills: ["LLMs", "Transformers", "GPT", "BERT", "Claude",
             "Prompt engineering", "RAG", "Vector databases",
             "LangChain", "Embeddings", "Fine-tuning"]
    notes: "Fast-evolving category; review quarterly"
    
  big_data:
    label: "Big Data Processing"
    skills: ["Spark", "PySpark", "Hadoop", "Hive", "Presto", "Flink"]
    
  pipelines_orchestration:
    label: "Pipelines & Orchestration"
    skills: ["Airflow", "Dagster", "Prefect", "Luigi",
             "Data pipelines", "ETL", "ELT"]
             
  data_modeling:
    label: "Data Modeling"
    skills: ["dbt", "Data modeling", "Dimensional modeling",
             "Star schema", "Data warehouse design"]
             
  warehouses_lakes:
    label: "Warehouses & Lakes"
    skills: ["Snowflake", "BigQuery", "Redshift", "Databricks",
             "Athena", "Delta Lake", "Data lake"]
             
  mlops:
    label: "MLOps"
    skills: ["MLflow", "Kubeflow", "Model serving", "Model monitoring",
             "Feature stores", "Model registry", "Weights & Biases"]
             
  cloud:
    label: "Cloud Platforms"
    skills: ["AWS", "GCP", "Azure", "S3", "EC2", "Lambda", "Cloud Functions"]
    
  streaming:
    label: "Streaming"
    skills: ["Kafka", "Kinesis", "Pub/Sub", "Real-time processing"]
    
  visualization:
    label: "Data Visualization"
    skills: ["Tableau", "Power BI", "Looker", "Metabase",
             "Plotly", "Matplotlib", "Seaborn"]

Company/Employer Taxonomy

System of Record: docs/schema_taxonomy.yaml (see enums.employer_industry)

Employer Industry (20 Domain-Focused Categories)

Design Decision: These are industry VERTICALS, not business models. "B2B SaaS" was intentionally excluded - it's a business model that spans multiple industries. A company like Stripe is fintech even though it sells B2B SaaS.

Code Label Examples
fintech FinTech Stripe, Monzo, Affirm, Plaid
healthtech HealthTech Flatiron, Omada, Oscar
ecommerce E-commerce & Marketplace Instacart, Deliveroo, Etsy
ai_ml AI/ML OpenAI, Anthropic, Harvey AI
consumer Consumer Tech Spotify, Reddit, Strava
mobility Mobility & Logistics Uber, Waymo, Zipline
proptech PropTech Airbnb, Zillow, CoStar
edtech EdTech Coursera, Duolingo
climate Climate Tech Watershed, Crusoe
crypto Crypto & Web3 Coinbase, Kraken
devtools Developer Tools GitHub, Vercel, Linear
data_infra Data Infrastructure Snowflake, Databricks, dbt Labs
cybersecurity Cybersecurity Okta, Vanta, 1Password
hr_tech HR Tech Rippling, Gusto, Deel
martech Marketing Tech Braze, Amplitude, HubSpot
professional_services Professional Services Deloitte, Accenture
productivity Productivity & Collaboration Notion, Asana, Airtable, Calendly
hardware Hardware & Robotics Apple, Gecko Robotics
other Other Catch-all

Employer Size

Code Label Signals
startup Startup (1-50) seed, series A, early stage
scaleup Scale-up (51-500) series B/C, growth stage
enterprise Enterprise (500+) public, Fortune 500, established

Multi-Industry Companies

Some companies span multiple industries. Classification rules:

  1. Single primary industry - Each company gets ONE industry value (MECE)
  2. Classify by core product/revenue - Stripe is fintech (payments), not devtools
  3. For conglomerates - Classify by the division most relevant to the job posting
Company Industry Rationale
Stripe fintech Core is payments, even though they have dev tools
Uber mobility Core is transportation
Airbnb proptech Real estate marketplace
Amazon (AWS jobs) devtools or data_infra Depends on specific role

Edge Case Resolution

Decision Framework

When encountering ambiguous roles:

1. Check title patterns against known signals
2. Analyze job description for disambiguating keywords
3. Look at team/department placement
4. Consider required tools/skills
5. Apply "where would the practitioner self-identify?" test
6. If still ambiguous, document and classify to best fit
7. Flag for quarterly taxonomy review

Documented Edge Cases

Role Pattern Decision Rationale
"AI Engineer" ML Engineer OR out_of_scope If model-focused → ML Engineer; if API integration only → out_of_scope
"Data Analyst" with dbt Analytics Engineer dbt is strong signal for AE over DA
"Business Intelligence Engineer" Data Analyst Despite "engineer" title, typically dashboard/reporting focused
"Applied Scientist" Data Scientist Amazon-specific title; responsibilities align with DS
"Product Analyst" Product Analytics Distinct from generic Data Analyst by product focus
"Growth Engineer" out_of_scope Engineering role, not data/product
"Technical Program Manager" out_of_scope Program management, not product management

Geographic Variations

Term US Meaning UK Meaning Resolution
"Data Scientist" Often ML-heavy Sometimes more analytics Check for ML signals
"Analyst" Entry-level connotation Can be senior Use seniority signals

Ontology Design (Future: Semantic Search)

Current State: Taxonomy

Hierarchical classification
├── Fixed categories
├── Rule-based assignment
└── Exact match on signals

Future State: Ontology

Semantic network
├── Entities with relationships
├── Embedding-based similarity
├── Natural language queries
└── Fuzzy matching with confidence

Preparation Steps

1. Rich Entity Descriptions

Every category needs a natural language description suitable for embedding:

yaml
analytics_engineer:
  embedding_description: |
    An Analytics Engineer builds and maintains the data transformation 
    layer between raw data sources and analyst-ready datasets. They 
    typically work with tools like dbt to create reusable data models, 
    define business metrics in a semantic layer, and ensure data quality 
    through testing. They bridge the gap between Data Engineers who 
    build pipelines and Data Analysts who consume clean data.
    
    Related roles: Data Analyst, Data Engineer, BI Developer
    Key differentiator: Focuses on transformation and modeling, not 
    pipeline infrastructure or end-user dashboards.

2. Relationship Types

yaml
relationships:
  is_a:
    description: "Hierarchical parent-child"
    example: "Analytics Engineer IS_A Data Role"
    
  related_to:
    description: "Conceptually similar, often confused"
    example: "Analytics Engineer RELATED_TO Data Analyst"
    
  requires_skill:
    description: "Role typically requires this skill"
    example: "Analytics Engineer REQUIRES_SKILL dbt"
    
  collaborates_with:
    description: "Roles that frequently work together"
    example: "Analytics Engineer COLLABORATES_WITH Data Scientist"
    
  progression_to:
    description: "Common career progression"
    example: "Data Analyst PROGRESSION_TO Analytics Engineer"

3. Embedding Strategy

yaml
embedding_approach:
  model: "text-embedding-3-small" # or similar
  
  what_to_embed:
    - role descriptions (paragraph form)
    - skill descriptions
    - job posting titles + first 500 chars
    
  similarity_thresholds:
    high_confidence: 0.85
    medium_confidence: 0.70
    needs_review: 0.50
    
  use_cases:
    - "Find roles similar to Analytics Engineer"
    - "What skills are adjacent to dbt?"
    - "Candidates with X skills might fit Y roles"

4. Query Patterns (Future)

Natural language queries the ontology should support:

"Show me roles that are like Data Scientist but more engineering-focused"
→ Returns: ML Engineer, Analytics Engineer

"What skills should a Data Analyst learn to become an Analytics Engineer?"
→ Returns: dbt, data modeling, SQL (advanced), Git

"Find companies where Analytics Engineers report to Engineering not Analytics"
→ Returns: [requires company org data]

"Which roles commonly transition to Product Management?"
→ Returns: Data Analyst, Product Analytics, Data Scientist

Taxonomy Maintenance

Review Cadence

Review Type Frequency Focus
Edge case log review Weekly Resolve accumulated ambiguities
Coverage analysis Monthly Identify gaps, new role patterns
Signal effectiveness Monthly Which signals are predictive?
Full taxonomy review Quarterly Add/remove/restructure categories
Skill taxonomy update Quarterly New tools, deprecated skills

Metrics to Track

Metric Target Action if Below
Classification confidence (avg) >0.85 Review low-confidence patterns
out_of_scope rate <15% Consider new categories
Edge case backlog <20 unresolved Schedule resolution session
Reclassification rate <5% Investigate unstable categories

Change Log Template

markdown
## Taxonomy Change Log

### [Date] - v1.X.X

**Added:**
- [New category/skill] - Rationale: [why needed]

**Changed:**
- [Category] - [What changed] - Rationale: [why]

**Removed:**
- [Category/skill] - Rationale: [why deprecated]

**Edge Cases Resolved:**
- [Role pattern] → Now classifies as [category]

**Open Questions:**
- [Unresolved issue for next review]

Output Formats

Classification Decision

markdown
## Classification: [Job Title]

**Input:** [Raw title and key description excerpts]

**Decision:**
- Family: [product/data]
- Subfamily: [specific category]
- Seniority: [level]
- Confidence: [high/medium/low]

**Signals Found:**
- Title: [matching patterns]
- Keywords: [matching terms]
- Tools: [specific tools mentioned]

**Disambiguation Notes:**
[If edge case, explain reasoning]

**Flags:**
- [ ] Needs human review
- [ ] New pattern for taxonomy consideration

Taxonomy Gap Analysis

markdown
## Gap Analysis: [Date]

**Coverage Summary:**
- Total roles analyzed: [N]
- Successfully classified: [N] ([%])
- Out of scope: [N] ([%])
- Low confidence: [N] ([%])

**Emerging Patterns:**
| Pattern | Frequency | Suggested Action |
|---------|-----------|------------------|
| [New title pattern] | [N] | [Add category / Add signal / Monitor] |

**Problem Categories:**
| Category | Issue | Recommendation |
|----------|-------|----------------|
| [Category] | [High confusion rate with X] | [Improve signals / Merge / Split] |

**Skill Gaps:**
- [New skills appearing frequently but not in taxonomy]

**Recommendations:**
1. [Specific change]
2. [Specific change]

Integration Points

With Classifier (Claude Haiku)

The taxonomy informs the classification prompt:

python
TAXONOMY_CONTEXT = """
Valid subfamilies for Data roles:
- product_analytics: Product metrics, experiments, user behavior
- data_analyst: Business reporting, dashboards, BI tools
- analytics_engineer: dbt, metrics layer, data modeling
- data_engineer: Pipelines, ETL/ELT, data infrastructure
- ml_engineer: Production ML, MLOps, model deployment
- data_scientist: Statistical modeling, predictions
- research_scientist: Novel ML research, publications
- data_architect: Data strategy, governance

Classification rules:
- "AI Engineer" → ml_engineer if model-focused, else out_of_scope
- Presence of "dbt" strongly indicates analytics_engineer
- "Business Analyst" → data_analyst unless product-focused
"""

With Job Feed (Filtering)

Taxonomy enables precise filtering:

sql
-- User selects "Analytics Engineer" 
-- Only returns exact subfamily match, not "Data Analyst"
WHERE job_subfamily = 'analytics_engineer'

-- User selects "Data" family
-- Returns all data subfamilies
WHERE job_family = 'data'

With Semantic Search (Future)

python
# Current: exact match
results = db.query("subfamily = 'analytics_engineer'")

# Future: semantic similarity
query_embedding = embed("data transformation and metrics modeling role")
results = vector_search(query_embedding, threshold=0.8)
# Returns: analytics_engineer, data_engineer (lower score)

Didn't find tool you were looking for?

Be as detailed as possible for better results