Agent skill
semantic-similarity
Semantic similarity computation for content relationships and intelligent discovery
Install this agent skill to your Project
npx add-skill https://github.com/a5c-ai/babysitter/tree/main/library/specializations/domains/business/knowledge-management/skills/semantic-similarity
Metadata
Additional technical details for this skill
- domain
- business
- category
- Knowledge Organization
- skill id
- SK-020
- specialization
- knowledge-management
SKILL.md
Semantic Similarity Skill
Overview
The Semantic Similarity skill provides advanced capabilities for computing and leveraging semantic relationships between content in knowledge management systems. Using modern embedding models and vector similarity techniques, this skill enables intelligent content discovery, recommendation, and organization beyond traditional keyword matching.
Capabilities
Document Embedding Generation
- Generate embeddings for documents and content
- Configure embedding models (OpenAI, Cohere, open-source)
- Implement batch embedding pipelines
- Manage embedding storage and retrieval
- Optimize embedding dimensions for use case
Sentence Transformer Models
- Configure sentence-transformers models
- Fine-tune models for domain-specific content
- Implement multi-lingual embedding models
- Design model selection strategies
Similarity Search and Clustering
- Implement vector similarity search (cosine, dot product)
- Configure approximate nearest neighbor (ANN) algorithms
- Design content clustering pipelines
- Implement hierarchical clustering for organization
Related Content Recommendation
- Build content recommendation systems
- Configure "More Like This" functionality
- Implement collaborative filtering with embeddings
- Design hybrid recommendation approaches
Duplicate Detection
- Identify duplicate and near-duplicate content
- Configure similarity thresholds for detection
- Implement deduplication workflows
- Design merge and consolidation strategies
Topic Modeling
- Implement LDA (Latent Dirichlet Allocation)
- Configure BERTopic for modern topic modeling
- Design topic hierarchies and taxonomies
- Enable dynamic topic tracking
Semantic Search Integration
- Configure semantic search pipelines
- Implement hybrid search (keyword + semantic)
- Design query expansion using embeddings
- Enable cross-lingual semantic search
Content Gap Analysis
- Identify missing content through similarity analysis
- Map content coverage using embeddings
- Detect underserved topics and areas
- Design content planning recommendations
Concept Extraction
- Extract key concepts from documents
- Build concept graphs from embeddings
- Implement keyphrase extraction
- Design concept tagging pipelines
Dependencies
- Sentence-transformers library
- OpenAI Embeddings API
- Cohere Embed API
- Pinecone vector database
- Weaviate
- Milvus
- FAISS (Facebook AI Similarity Search)
- scikit-learn for clustering
Process Integration
This skill integrates with:
- search-optimization.js: Semantic search and related content features
- knowledge-base-content.js: Content recommendations and gap analysis
- tacit-to-explicit-conversion.js: Knowledge representation and concept extraction
Usage
Generate Document Embeddings
task: Generate embeddings for knowledge base content
skill: semantic-similarity
parameters:
source: knowledge-base
model: text-embedding-3-small
batch_size: 100
output: vector-store
dimensions: 1536
Configure Similarity Search
task: Set up semantic similarity search
skill: semantic-similarity
parameters:
vector_store: pinecone
index_name: kb-embeddings
similarity_metric: cosine
top_k: 10
hybrid_search: true
keyword_weight: 0.3
Duplicate Detection
task: Identify duplicate content
skill: semantic-similarity
parameters:
threshold: 0.92
scope: all-documents
output: duplicate-report.json
action: flag_for_review
Topic Modeling
task: Generate topic model for knowledge base
skill: semantic-similarity
parameters:
method: bertopic
min_topic_size: 10
nr_topics: auto
output: topic-model
visualizations: true
Best Practices
- Choose appropriate embedding models - Match model to content type and language
- Normalize embeddings - Ensure consistent similarity scores across documents
- Set appropriate thresholds - Tune similarity thresholds for your use case
- Implement hybrid search - Combine semantic and keyword search for best results
- Monitor embedding drift - Re-embed content periodically as models improve
- Consider latency - Cache frequently used embeddings for performance
- Plan for scale - Use ANN indexes for large document collections
- Handle long documents - Implement chunking strategies for lengthy content
Architecture Patterns
Basic Semantic Search Pipeline
Document -> Chunking -> Embedding -> Vector Store -> Query -> Results
Hybrid Search Architecture
Query -> [Keyword Search] -> Results
-> [Semantic Search] -> Results
-> [Reranking] -> Final Results
Recommendation Pipeline
User Context -> Find Similar Content -> Filter by Metadata -> Personalize -> Recommend
Metrics
Key metrics for semantic similarity systems:
| Metric | Description | Target |
|---|---|---|
| Retrieval Precision | Relevant results in top-k | > 80% |
| Search Latency | Time for similarity search | < 200ms |
| Duplicate Detection F1 | Accuracy of duplicate finding | > 90% |
| Topic Coherence | Quality of topic models | > 0.5 |
| User Satisfaction | Relevance ratings | > 4.0/5.0 |
Related Skills
- knowledge-graph (SK-008): Graph-based semantic relationships
- search-engine (SK-005): Enterprise search integration
- content-curation (SK-010): Quality-based content management
Related Agents
- kg-specialist (AG-008): Knowledge graph and semantic expertise
- search-expert (AG-004): Search optimization guidance
- knowledge-architect (AG-001): Overall KM strategy alignment
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gsd-tools
Central utility skill for GSD operations. Provides config parsing, slug generation, timestamps, path operations, and orchestrates calls to other specialized skills. Acts as the unified entry point that the original gsd-tools.cjs provided via its lib/ modules (commands, config, core, init).
model-profile-resolution
Resolve model profile (quality/balanced/budget) at orchestration start and map agents to specific models. Enables cost/quality tradeoffs by selecting appropriate AI models for each agent role.
verification-suite
Plan structure validation, phase completeness checks, reference integrity verification, and artifact existence confirmation. Provides the structured verification layer ensuring GSD artifacts are well-formed and complete.
state-management
STATE.md reading, writing, and field-level updates. Provides cross-session state persistence via .planning/STATE.md with structured fields for current task, completed phases, blockers, decisions, and quick tasks.
git-integration
Git commit patterns, formats, and conventions for GSD methodology. Provides atomic commits per task, structured commit messages, planning file commits, branch management, and milestone tag operations.
frontmatter-parsing
YAML frontmatter parsing and manipulation for .planning/ documents. Provides read, write, update, query, and validation operations on frontmatter blocks in GSD markdown artifacts.
Didn't find tool you were looking for?