Agent skill
taxonomy-architect
Design and maintain classification systems for jobs, skills, and companies. Use when defining categories, resolving edge cases, planning ontology structures, or preparing for semantic search capabilities.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/taxonomy-architect
SKILL.md
Taxonomy Architect
Design, maintain, and evolve classification systems that power job matching, skill analysis, and company categorization. Ensure taxonomies are precise, consistent, and extensible toward future semantic search capabilities.
When to Use This Skill
Trigger when user asks to:
- Define or refine job role categories (families, subfamilies)
- Create or update skill taxonomies
- Classify ambiguous roles or edge cases
- Design company categorization schemes
- Plan ontology structures for semantic search
- Resolve classification conflicts or inconsistencies
- Evaluate taxonomy coverage and gaps
- Prepare embeddings or vector search strategies
Core Principles
1. Mutually Exclusive, Collectively Exhaustive (MECE)
Categories at the same level should not overlap, and together should cover all cases.
BAD: GOOD:
├── Data Analyst ├── Data Analyst
├── Business Analyst ├── Analytics Engineer
├── Analytics (overlap!) ├── Data Engineer
└── BI Developer (overlap!) └── Data Scientist
2. User Mental Model Alignment
Categories should match how practitioners describe themselves, not internal corporate structures.
BAD (org-chart thinking): GOOD (practitioner thinking):
├── Engineering ├── Data Engineer
│ └── Data ├── ML Engineer
├── Analytics ├── Analytics Engineer
│ └── Data └── Data Analyst
└── Science
└── Data
3. Stable Core, Flexible Edges
Core categories should be stable over time. Edge cases and emerging roles should be handled without restructuring the core.
STABLE CORE: EDGE HANDLING:
├── Data Engineer "AI Engineer" → classify as:
├── ML Engineer - ML Engineer (if model-focused)
├── Data Scientist - out_of_scope (if API integration)
└── Analytics Engineer Document decision, revisit quarterly
4. Evidence-Based Boundaries
Category boundaries should be defined by observable signals in job postings, not assumptions.
| Signal Type | Examples |
|---|---|
| Title patterns | "Analytics Engineer" vs "Data Analyst" |
| Tool requirements | dbt, Airflow, Spark → Data/Analytics Engineer |
| Responsibility keywords | "build pipelines" vs "create dashboards" |
| Team placement | "Data Platform team" vs "Business Intelligence" |
| Seniority markers | "Principal", "Staff", "Lead", "Senior", "Junior" |
5. Semantic Readiness
Design with future embedding/vector search in mind. Categories should be describable in natural language that captures semantic meaning.
GOOD (embeddable description):
"Analytics Engineer: Builds and maintains data transformation
pipelines using tools like dbt, creates metrics layers and
semantic models, bridges raw data and analyst-ready datasets."
BAD (list of keywords):
"Analytics Engineer: dbt, SQL, data modeling, metrics"
Current Taxonomy (v1.5)
Job Families & Subfamilies
job_families:
product:
description: "Roles focused on product strategy, discovery, and delivery"
subfamilies:
core_pm:
label: "Core PM"
description: "General product management for user-facing features"
signals:
titles: ["Product Manager", "PM", "Product Lead"]
keywords: ["roadmap", "user stories", "stakeholders", "prioritization"]
anti_signals: ["growth", "platform", "API", "ML", "AI"]
growth_pm:
label: "Growth PM"
description: "Acquisition, retention, monetization, conversion optimization"
signals:
titles: ["Growth PM", "Growth Product Manager"]
keywords: ["acquisition", "retention", "conversion", "funnel", "experimentation", "A/B testing"]
platform_pm:
label: "Platform PM"
description: "Developer tools, APIs, infrastructure products"
signals:
titles: ["Platform PM", "API PM", "Developer Experience PM"]
keywords: ["API", "SDK", "developer", "platform", "infrastructure", "internal tools"]
technical_pm:
label: "Technical PM"
description: "Deep technical skills required, often ex-engineers"
signals:
titles: ["Technical Product Manager", "TPM"]
keywords: ["technical requirements", "engineering background", "system design"]
ai_ml_pm:
label: "AI/ML PM"
description: "AI/ML products, models, data products"
signals:
titles: ["AI PM", "ML PM", "AI Product Manager", "Data Product Manager"]
keywords: ["machine learning", "AI", "model", "LLM", "GenAI", "data product"]
data:
description: "Roles focused on data infrastructure, analysis, and machine learning"
subfamilies:
product_analytics:
label: "Product Analytics"
description: "Product metrics, experiments, user behavior, growth analytics"
signals:
titles: ["Product Analyst", "Growth Analyst", "Product Data Analyst"]
keywords: ["product metrics", "experimentation", "user behavior", "funnel analysis", "Amplitude", "Mixpanel"]
anti_signals: ["pipeline", "infrastructure", "model training"]
data_analyst:
label: "Data Analyst"
description: "Business reporting, dashboards, SQL analysis, BI tools"
signals:
titles: ["Data Analyst", "Business Analyst", "BI Analyst", "Reporting Analyst"]
keywords: ["dashboards", "reporting", "Tableau", "Power BI", "Looker", "business intelligence"]
anti_signals: ["dbt", "pipeline", "modeling layer"]
analytics_engineer:
label: "Analytics Engineer"
description: "dbt, metrics layer, data modeling, semantic layer"
signals:
titles: ["Analytics Engineer", "Data Modeling Engineer"]
keywords: ["dbt", "data modeling", "metrics layer", "semantic layer", "transformation"]
disambiguate_from: ["data_analyst", "data_engineer"]
data_engineer:
label: "Data Engineer"
description: "Pipelines, infrastructure, ETL/ELT, big data"
signals:
titles: ["Data Engineer", "ETL Developer", "Data Platform Engineer"]
keywords: ["pipeline", "Airflow", "Spark", "ETL", "ELT", "data infrastructure", "Kafka"]
ml_engineer:
label: "ML Engineer"
description: "Production ML systems, MLOps, includes LLM/GenAI implementation"
signals:
titles: ["ML Engineer", "Machine Learning Engineer", "MLOps Engineer", "AI Engineer"]
keywords: ["model deployment", "MLOps", "feature store", "model serving", "LLM", "fine-tuning"]
notes: "AI Engineer roles classify here if model-focused; out_of_scope if primarily API integration"
data_scientist:
label: "Data Scientist"
description: "Statistical modeling, predictions, business insights"
signals:
titles: ["Data Scientist", "Senior Data Scientist", "Applied Scientist"]
keywords: ["statistical modeling", "prediction", "regression", "classification", "causal inference"]
disambiguate_from: ["ml_engineer", "research_scientist"]
research_scientist:
label: "Research Scientist (ML/AI)"
description: "Novel ML research, publications, pushing state-of-the-art"
signals:
titles: ["Research Scientist", "ML Researcher", "AI Researcher"]
keywords: ["publications", "novel", "state-of-the-art", "research", "PhD"]
data_architect:
label: "Data Architect"
description: "Data strategy, governance, platform design"
signals:
titles: ["Data Architect", "Enterprise Data Architect", "Data Governance Lead"]
keywords: ["data strategy", "governance", "data catalog", "metadata", "architecture"]
Seniority Levels
seniority:
junior:
label: "Junior"
signals:
titles: ["Junior", "Jr", "Associate", "Entry Level", "Graduate"]
experience: ["0-2 years", "entry level", "new grad"]
mid:
label: "Mid-Level"
signals:
titles: ["Data Engineer", "Product Manager"] # No prefix = usually mid
experience: ["2-5 years", "3+ years"]
senior:
label: "Senior"
signals:
titles: ["Senior", "Sr", "Lead"]
experience: ["5+ years", "7+ years"]
staff_plus:
label: "Staff+"
signals:
titles: ["Staff", "Principal", "Distinguished", "Architect", "Director"]
experience: ["10+ years", "extensive experience"]
Working Arrangement
working_arrangement:
onsite:
label: "Onsite"
signals: ["on-site", "in-office", "office-based", "in-person"]
hybrid:
label: "Hybrid"
signals: ["hybrid", "flexible", "2-3 days in office", "partial remote"]
remote:
label: "Remote"
signals: ["remote", "work from home", "distributed", "anywhere"]
qualifiers: ["remote (US only)", "remote (timezone restricted)"]
Skills Taxonomy
Structure
skills:
parent_categories:
product:
label: "Product Skills"
families:
- discovery_research
- execution_delivery
- experimentation
- analytics_pm
- stakeholder_mgmt
data_ml:
label: "Data/ML Skills"
families:
- programming
- analytics_stats
- classical_ml
- deep_learning
- llm_genai
- big_data
- pipelines_orchestration
- data_modeling
- warehouses_lakes
- mlops
- cloud
- streaming
- visualization
platform_infra:
label: "Platform/Infra Skills"
families:
- deployment
- infrastructure_code
- ci_cd
- monitoring
Skill Family Details
data_ml:
programming:
label: "Programming Languages"
skills: ["Python", "R", "SQL", "Scala", "Java", "Julia"]
notes: "SQL is both a language and a skill; always extract"
analytics_stats:
label: "Analytics & Statistics"
skills: ["Statistics", "Probability", "Regression", "Causal inference",
"Time series", "Hypothesis testing", "Bayesian analysis"]
classical_ml:
label: "Classical Machine Learning"
skills: ["Scikit-learn", "XGBoost", "LightGBM", "Random Forest",
"Logistic regression", "SVM", "Feature engineering"]
deep_learning:
label: "Deep Learning"
skills: ["PyTorch", "TensorFlow", "Keras", "Neural networks",
"CNNs", "RNNs", "Computer vision", "NLP"]
llm_genai:
label: "LLM/GenAI"
skills: ["LLMs", "Transformers", "GPT", "BERT", "Claude",
"Prompt engineering", "RAG", "Vector databases",
"LangChain", "Embeddings", "Fine-tuning"]
notes: "Fast-evolving category; review quarterly"
big_data:
label: "Big Data Processing"
skills: ["Spark", "PySpark", "Hadoop", "Hive", "Presto", "Flink"]
pipelines_orchestration:
label: "Pipelines & Orchestration"
skills: ["Airflow", "Dagster", "Prefect", "Luigi",
"Data pipelines", "ETL", "ELT"]
data_modeling:
label: "Data Modeling"
skills: ["dbt", "Data modeling", "Dimensional modeling",
"Star schema", "Data warehouse design"]
warehouses_lakes:
label: "Warehouses & Lakes"
skills: ["Snowflake", "BigQuery", "Redshift", "Databricks",
"Athena", "Delta Lake", "Data lake"]
mlops:
label: "MLOps"
skills: ["MLflow", "Kubeflow", "Model serving", "Model monitoring",
"Feature stores", "Model registry", "Weights & Biases"]
cloud:
label: "Cloud Platforms"
skills: ["AWS", "GCP", "Azure", "S3", "EC2", "Lambda", "Cloud Functions"]
streaming:
label: "Streaming"
skills: ["Kafka", "Kinesis", "Pub/Sub", "Real-time processing"]
visualization:
label: "Data Visualization"
skills: ["Tableau", "Power BI", "Looker", "Metabase",
"Plotly", "Matplotlib", "Seaborn"]
Company/Employer Taxonomy
System of Record: docs/schema_taxonomy.yaml (see enums.employer_industry)
Employer Industry (20 Domain-Focused Categories)
Design Decision: These are industry VERTICALS, not business models. "B2B SaaS" was intentionally excluded - it's a business model that spans multiple industries. A company like Stripe is fintech even though it sells B2B SaaS.
| Code | Label | Examples |
|---|---|---|
fintech |
FinTech | Stripe, Monzo, Affirm, Plaid |
healthtech |
HealthTech | Flatiron, Omada, Oscar |
ecommerce |
E-commerce & Marketplace | Instacart, Deliveroo, Etsy |
ai_ml |
AI/ML | OpenAI, Anthropic, Harvey AI |
consumer |
Consumer Tech | Spotify, Reddit, Strava |
mobility |
Mobility & Logistics | Uber, Waymo, Zipline |
proptech |
PropTech | Airbnb, Zillow, CoStar |
edtech |
EdTech | Coursera, Duolingo |
climate |
Climate Tech | Watershed, Crusoe |
crypto |
Crypto & Web3 | Coinbase, Kraken |
devtools |
Developer Tools | GitHub, Vercel, Linear |
data_infra |
Data Infrastructure | Snowflake, Databricks, dbt Labs |
cybersecurity |
Cybersecurity | Okta, Vanta, 1Password |
hr_tech |
HR Tech | Rippling, Gusto, Deel |
martech |
Marketing Tech | Braze, Amplitude, HubSpot |
professional_services |
Professional Services | Deloitte, Accenture |
productivity |
Productivity & Collaboration | Notion, Asana, Airtable, Calendly |
hardware |
Hardware & Robotics | Apple, Gecko Robotics |
other |
Other | Catch-all |
Employer Size
| Code | Label | Signals |
|---|---|---|
startup |
Startup (1-50) | seed, series A, early stage |
scaleup |
Scale-up (51-500) | series B/C, growth stage |
enterprise |
Enterprise (500+) | public, Fortune 500, established |
Multi-Industry Companies
Some companies span multiple industries. Classification rules:
- Single primary industry - Each company gets ONE
industryvalue (MECE) - Classify by core product/revenue - Stripe is
fintech(payments), notdevtools - For conglomerates - Classify by the division most relevant to the job posting
| Company | Industry | Rationale |
|---|---|---|
| Stripe | fintech |
Core is payments, even though they have dev tools |
| Uber | mobility |
Core is transportation |
| Airbnb | proptech |
Real estate marketplace |
| Amazon (AWS jobs) | devtools or data_infra |
Depends on specific role |
Edge Case Resolution
Decision Framework
When encountering ambiguous roles:
1. Check title patterns against known signals
2. Analyze job description for disambiguating keywords
3. Look at team/department placement
4. Consider required tools/skills
5. Apply "where would the practitioner self-identify?" test
6. If still ambiguous, document and classify to best fit
7. Flag for quarterly taxonomy review
Documented Edge Cases
| Role Pattern | Decision | Rationale |
|---|---|---|
| "AI Engineer" | ML Engineer OR out_of_scope | If model-focused → ML Engineer; if API integration only → out_of_scope |
| "Data Analyst" with dbt | Analytics Engineer | dbt is strong signal for AE over DA |
| "Business Intelligence Engineer" | Data Analyst | Despite "engineer" title, typically dashboard/reporting focused |
| "Applied Scientist" | Data Scientist | Amazon-specific title; responsibilities align with DS |
| "Product Analyst" | Product Analytics | Distinct from generic Data Analyst by product focus |
| "Growth Engineer" | out_of_scope | Engineering role, not data/product |
| "Technical Program Manager" | out_of_scope | Program management, not product management |
Geographic Variations
| Term | US Meaning | UK Meaning | Resolution |
|---|---|---|---|
| "Data Scientist" | Often ML-heavy | Sometimes more analytics | Check for ML signals |
| "Analyst" | Entry-level connotation | Can be senior | Use seniority signals |
Ontology Design (Future: Semantic Search)
Current State: Taxonomy
Hierarchical classification
├── Fixed categories
├── Rule-based assignment
└── Exact match on signals
Future State: Ontology
Semantic network
├── Entities with relationships
├── Embedding-based similarity
├── Natural language queries
└── Fuzzy matching with confidence
Preparation Steps
1. Rich Entity Descriptions
Every category needs a natural language description suitable for embedding:
analytics_engineer:
embedding_description: |
An Analytics Engineer builds and maintains the data transformation
layer between raw data sources and analyst-ready datasets. They
typically work with tools like dbt to create reusable data models,
define business metrics in a semantic layer, and ensure data quality
through testing. They bridge the gap between Data Engineers who
build pipelines and Data Analysts who consume clean data.
Related roles: Data Analyst, Data Engineer, BI Developer
Key differentiator: Focuses on transformation and modeling, not
pipeline infrastructure or end-user dashboards.
2. Relationship Types
relationships:
is_a:
description: "Hierarchical parent-child"
example: "Analytics Engineer IS_A Data Role"
related_to:
description: "Conceptually similar, often confused"
example: "Analytics Engineer RELATED_TO Data Analyst"
requires_skill:
description: "Role typically requires this skill"
example: "Analytics Engineer REQUIRES_SKILL dbt"
collaborates_with:
description: "Roles that frequently work together"
example: "Analytics Engineer COLLABORATES_WITH Data Scientist"
progression_to:
description: "Common career progression"
example: "Data Analyst PROGRESSION_TO Analytics Engineer"
3. Embedding Strategy
embedding_approach:
model: "text-embedding-3-small" # or similar
what_to_embed:
- role descriptions (paragraph form)
- skill descriptions
- job posting titles + first 500 chars
similarity_thresholds:
high_confidence: 0.85
medium_confidence: 0.70
needs_review: 0.50
use_cases:
- "Find roles similar to Analytics Engineer"
- "What skills are adjacent to dbt?"
- "Candidates with X skills might fit Y roles"
4. Query Patterns (Future)
Natural language queries the ontology should support:
"Show me roles that are like Data Scientist but more engineering-focused"
→ Returns: ML Engineer, Analytics Engineer
"What skills should a Data Analyst learn to become an Analytics Engineer?"
→ Returns: dbt, data modeling, SQL (advanced), Git
"Find companies where Analytics Engineers report to Engineering not Analytics"
→ Returns: [requires company org data]
"Which roles commonly transition to Product Management?"
→ Returns: Data Analyst, Product Analytics, Data Scientist
Taxonomy Maintenance
Review Cadence
| Review Type | Frequency | Focus |
|---|---|---|
| Edge case log review | Weekly | Resolve accumulated ambiguities |
| Coverage analysis | Monthly | Identify gaps, new role patterns |
| Signal effectiveness | Monthly | Which signals are predictive? |
| Full taxonomy review | Quarterly | Add/remove/restructure categories |
| Skill taxonomy update | Quarterly | New tools, deprecated skills |
Metrics to Track
| Metric | Target | Action if Below |
|---|---|---|
| Classification confidence (avg) | >0.85 | Review low-confidence patterns |
| out_of_scope rate | <15% | Consider new categories |
| Edge case backlog | <20 unresolved | Schedule resolution session |
| Reclassification rate | <5% | Investigate unstable categories |
Change Log Template
## Taxonomy Change Log
### [Date] - v1.X.X
**Added:**
- [New category/skill] - Rationale: [why needed]
**Changed:**
- [Category] - [What changed] - Rationale: [why]
**Removed:**
- [Category/skill] - Rationale: [why deprecated]
**Edge Cases Resolved:**
- [Role pattern] → Now classifies as [category]
**Open Questions:**
- [Unresolved issue for next review]
Output Formats
Classification Decision
## Classification: [Job Title]
**Input:** [Raw title and key description excerpts]
**Decision:**
- Family: [product/data]
- Subfamily: [specific category]
- Seniority: [level]
- Confidence: [high/medium/low]
**Signals Found:**
- Title: [matching patterns]
- Keywords: [matching terms]
- Tools: [specific tools mentioned]
**Disambiguation Notes:**
[If edge case, explain reasoning]
**Flags:**
- [ ] Needs human review
- [ ] New pattern for taxonomy consideration
Taxonomy Gap Analysis
## Gap Analysis: [Date]
**Coverage Summary:**
- Total roles analyzed: [N]
- Successfully classified: [N] ([%])
- Out of scope: [N] ([%])
- Low confidence: [N] ([%])
**Emerging Patterns:**
| Pattern | Frequency | Suggested Action |
|---------|-----------|------------------|
| [New title pattern] | [N] | [Add category / Add signal / Monitor] |
**Problem Categories:**
| Category | Issue | Recommendation |
|----------|-------|----------------|
| [Category] | [High confusion rate with X] | [Improve signals / Merge / Split] |
**Skill Gaps:**
- [New skills appearing frequently but not in taxonomy]
**Recommendations:**
1. [Specific change]
2. [Specific change]
Integration Points
With Classifier (Claude Haiku)
The taxonomy informs the classification prompt:
TAXONOMY_CONTEXT = """
Valid subfamilies for Data roles:
- product_analytics: Product metrics, experiments, user behavior
- data_analyst: Business reporting, dashboards, BI tools
- analytics_engineer: dbt, metrics layer, data modeling
- data_engineer: Pipelines, ETL/ELT, data infrastructure
- ml_engineer: Production ML, MLOps, model deployment
- data_scientist: Statistical modeling, predictions
- research_scientist: Novel ML research, publications
- data_architect: Data strategy, governance
Classification rules:
- "AI Engineer" → ml_engineer if model-focused, else out_of_scope
- Presence of "dbt" strongly indicates analytics_engineer
- "Business Analyst" → data_analyst unless product-focused
"""
With Job Feed (Filtering)
Taxonomy enables precise filtering:
-- User selects "Analytics Engineer"
-- Only returns exact subfamily match, not "Data Analyst"
WHERE job_subfamily = 'analytics_engineer'
-- User selects "Data" family
-- Returns all data subfamilies
WHERE job_family = 'data'
With Semantic Search (Future)
# Current: exact match
results = db.query("subfamily = 'analytics_engineer'")
# Future: semantic similarity
query_embedding = embed("data transformation and metrics modeling role")
results = vector_search(query_embedding, threshold=0.8)
# Returns: analytics_engineer, data_engineer (lower score)
Didn't find tool you were looking for?