Agent skill

sklearn-model-trainer

Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.

Stars 514
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/a5c-ai/babysitter/tree/main/library/specializations/data-science-ml/skills/sklearn-model-trainer

SKILL.md

Scikit-learn Model Trainer

Train machine learning models using scikit-learn with cross-validation, hyperparameter tuning, and pipeline construction.

Overview

This skill provides comprehensive capabilities for training machine learning models using scikit-learn. It supports the full model development workflow from data preprocessing through model training, evaluation, and serialization.

Capabilities

Model Training

  • Train classification models (LogisticRegression, RandomForest, SVM, etc.)
  • Train regression models (LinearRegression, GradientBoosting, etc.)
  • Train clustering models (KMeans, DBSCAN, etc.)
  • Support for ensemble methods (VotingClassifier, Stacking, etc.)

Cross-Validation

  • K-fold cross-validation
  • Stratified K-fold for imbalanced datasets
  • Time series split for temporal data
  • Leave-one-out and leave-p-out validation
  • Custom cross-validation strategies

Hyperparameter Tuning

  • GridSearchCV for exhaustive search
  • RandomizedSearchCV for random sampling
  • Halving search strategies for efficiency
  • Custom scoring functions
  • Multi-metric evaluation

Pipeline Construction

  • Feature preprocessing pipelines
  • Column transformers for heterogeneous data
  • Feature selection integration
  • Composite pipelines with caching

Model Serialization

  • Save models with joblib (recommended)
  • Pickle serialization
  • ONNX export for interoperability
  • Model versioning support

Prerequisites

Installation

bash
pip install scikit-learn>=1.0.0 joblib pandas numpy

Optional Dependencies

bash
# For ONNX export
pip install skl2onnx onnxruntime

# For additional preprocessing
pip install category_encoders imbalanced-learn

Usage Patterns

Basic Model Training

python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import joblib

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)
model.fit(X_train, y_train)

# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Save model
joblib.dump(model, 'model.joblib')

Pipeline with Preprocessing

python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

# Define preprocessing
numeric_features = ['age', 'income', 'score']
categorical_features = ['category', 'region']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Create full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier())
])

# Train
pipeline.fit(X_train, y_train)

Hyperparameter Tuning with GridSearchCV

python
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [3, 5, 10, None],
    'classifier__learning_rate': [0.01, 0.1, 0.2]
}

# Grid search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")

# Get best model
best_model = grid_search.best_estimator_

Feature Selection

python
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.ensemble import RandomForestClassifier

# Method 1: SelectFromModel
selector = SelectFromModel(
    RandomForestClassifier(n_estimators=100, random_state=42),
    threshold='median'
)
X_selected = selector.fit_transform(X_train, y_train)

# Method 2: Recursive Feature Elimination
rfe = RFE(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    n_features_to_select=10,
    step=1
)
X_rfe = rfe.fit_transform(X_train, y_train)

# Get selected features
selected_features = X.columns[rfe.support_].tolist()

Integration with Babysitter SDK

Task Definition Example

javascript
const sklearnTrainingTask = defineTask({
  name: 'sklearn-model-training',
  description: 'Train a scikit-learn model with cross-validation',

  inputs: {
    modelType: { type: 'string', required: true },
    trainDataPath: { type: 'string', required: true },
    targetColumn: { type: 'string', required: true },
    hyperparameters: { type: 'object', default: {} },
    cvFolds: { type: 'number', default: 5 },
    scoringMetric: { type: 'string', default: 'accuracy' }
  },

  outputs: {
    modelPath: { type: 'string' },
    cvScores: { type: 'array' },
    bestScore: { type: 'number' },
    featureImportances: { type: 'object' }
  },

  async run(inputs, taskCtx) {
    return {
      kind: 'skill',
      title: `Train ${inputs.modelType} model`,
      skill: {
        name: 'sklearn-model-trainer',
        context: {
          operation: 'train_with_cv',
          modelType: inputs.modelType,
          trainDataPath: inputs.trainDataPath,
          targetColumn: inputs.targetColumn,
          hyperparameters: inputs.hyperparameters,
          cvFolds: inputs.cvFolds,
          scoringMetric: inputs.scoringMetric
        }
      },
      io: {
        inputJsonPath: `tasks/${taskCtx.effectId}/input.json`,
        outputJsonPath: `tasks/${taskCtx.effectId}/result.json`
      }
    };
  }
});

Model Selection Guide

Classification Models

Model Use Case Pros Cons
LogisticRegression Binary/multiclass, interpretable Fast, interpretable Linear boundary
RandomForestClassifier General purpose Robust, handles nonlinearity Can overfit
GradientBoostingClassifier High accuracy needed State-of-art performance Slower training
SVC Small/medium datasets Effective in high dimensions Slow on large data
XGBClassifier Competition/production Fast, accurate Many hyperparameters

Regression Models

Model Use Case Pros Cons
LinearRegression Baseline, interpretable Simple, fast Assumes linearity
Ridge/Lasso Regularization needed Prevents overfitting Still linear
RandomForestRegressor General purpose Handles nonlinearity Can overfit
GradientBoostingRegressor High accuracy Excellent performance Slower
SVR Small datasets Robust to outliers Slow scaling

Best Practices

  1. Always Use Pipelines: Prevent data leakage by including preprocessing in pipelines
  2. Stratified Splits: Use stratified sampling for imbalanced classification
  3. Cross-Validation: Never tune hyperparameters on test data
  4. Feature Scaling: Apply appropriate scaling for distance-based models
  5. Random Seeds: Set random_state for reproducibility
  6. Model Persistence: Use joblib over pickle for large numpy arrays

References

Expand your agent's capabilities with these related and highly-rated skills.

a5c-ai/babysitter

gsd-tools

Central utility skill for GSD operations. Provides config parsing, slug generation, timestamps, path operations, and orchestrates calls to other specialized skills. Acts as the unified entry point that the original gsd-tools.cjs provided via its lib/ modules (commands, config, core, init).

514 31
Explore
a5c-ai/babysitter

model-profile-resolution

Resolve model profile (quality/balanced/budget) at orchestration start and map agents to specific models. Enables cost/quality tradeoffs by selecting appropriate AI models for each agent role.

514 31
Explore
a5c-ai/babysitter

verification-suite

Plan structure validation, phase completeness checks, reference integrity verification, and artifact existence confirmation. Provides the structured verification layer ensuring GSD artifacts are well-formed and complete.

514 31
Explore
a5c-ai/babysitter

state-management

STATE.md reading, writing, and field-level updates. Provides cross-session state persistence via .planning/STATE.md with structured fields for current task, completed phases, blockers, decisions, and quick tasks.

514 31
Explore
a5c-ai/babysitter

git-integration

Git commit patterns, formats, and conventions for GSD methodology. Provides atomic commits per task, structured commit messages, planning file commits, branch management, and milestone tag operations.

514 31
Explore
a5c-ai/babysitter

frontmatter-parsing

YAML frontmatter parsing and manipulation for .planning/ documents. Provides read, write, update, query, and validation operations on frontmatter blocks in GSD markdown artifacts.

514 31
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results