Agent skill
sklearn-model-trainer
Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
Install this agent skill to your Project
npx add-skill https://github.com/a5c-ai/babysitter/tree/main/library/specializations/data-science-ml/skills/sklearn-model-trainer
SKILL.md
Scikit-learn Model Trainer
Train machine learning models using scikit-learn with cross-validation, hyperparameter tuning, and pipeline construction.
Overview
This skill provides comprehensive capabilities for training machine learning models using scikit-learn. It supports the full model development workflow from data preprocessing through model training, evaluation, and serialization.
Capabilities
Model Training
- Train classification models (LogisticRegression, RandomForest, SVM, etc.)
- Train regression models (LinearRegression, GradientBoosting, etc.)
- Train clustering models (KMeans, DBSCAN, etc.)
- Support for ensemble methods (VotingClassifier, Stacking, etc.)
Cross-Validation
- K-fold cross-validation
- Stratified K-fold for imbalanced datasets
- Time series split for temporal data
- Leave-one-out and leave-p-out validation
- Custom cross-validation strategies
Hyperparameter Tuning
- GridSearchCV for exhaustive search
- RandomizedSearchCV for random sampling
- Halving search strategies for efficiency
- Custom scoring functions
- Multi-metric evaluation
Pipeline Construction
- Feature preprocessing pipelines
- Column transformers for heterogeneous data
- Feature selection integration
- Composite pipelines with caching
Model Serialization
- Save models with joblib (recommended)
- Pickle serialization
- ONNX export for interoperability
- Model versioning support
Prerequisites
Installation
pip install scikit-learn>=1.0.0 joblib pandas numpy
Optional Dependencies
# For ONNX export
pip install skl2onnx onnxruntime
# For additional preprocessing
pip install category_encoders imbalanced-learn
Usage Patterns
Basic Model Training
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import joblib
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train model
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42
)
model.fit(X_train, y_train)
# Cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
# Save model
joblib.dump(model, 'model.joblib')
Pipeline with Preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
# Define preprocessing
numeric_features = ['age', 'income', 'score']
categorical_features = ['category', 'region']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Create full pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GradientBoostingClassifier())
])
# Train
pipeline.fit(X_train, y_train)
Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [3, 5, 10, None],
'classifier__learning_rate': [0.01, 0.1, 0.2]
}
# Grid search
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1,
verbose=2
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Get best model
best_model = grid_search.best_estimator_
Feature Selection
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.ensemble import RandomForestClassifier
# Method 1: SelectFromModel
selector = SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42),
threshold='median'
)
X_selected = selector.fit_transform(X_train, y_train)
# Method 2: Recursive Feature Elimination
rfe = RFE(
estimator=RandomForestClassifier(n_estimators=100, random_state=42),
n_features_to_select=10,
step=1
)
X_rfe = rfe.fit_transform(X_train, y_train)
# Get selected features
selected_features = X.columns[rfe.support_].tolist()
Integration with Babysitter SDK
Task Definition Example
const sklearnTrainingTask = defineTask({
name: 'sklearn-model-training',
description: 'Train a scikit-learn model with cross-validation',
inputs: {
modelType: { type: 'string', required: true },
trainDataPath: { type: 'string', required: true },
targetColumn: { type: 'string', required: true },
hyperparameters: { type: 'object', default: {} },
cvFolds: { type: 'number', default: 5 },
scoringMetric: { type: 'string', default: 'accuracy' }
},
outputs: {
modelPath: { type: 'string' },
cvScores: { type: 'array' },
bestScore: { type: 'number' },
featureImportances: { type: 'object' }
},
async run(inputs, taskCtx) {
return {
kind: 'skill',
title: `Train ${inputs.modelType} model`,
skill: {
name: 'sklearn-model-trainer',
context: {
operation: 'train_with_cv',
modelType: inputs.modelType,
trainDataPath: inputs.trainDataPath,
targetColumn: inputs.targetColumn,
hyperparameters: inputs.hyperparameters,
cvFolds: inputs.cvFolds,
scoringMetric: inputs.scoringMetric
}
},
io: {
inputJsonPath: `tasks/${taskCtx.effectId}/input.json`,
outputJsonPath: `tasks/${taskCtx.effectId}/result.json`
}
};
}
});
Model Selection Guide
Classification Models
| Model | Use Case | Pros | Cons |
|---|---|---|---|
| LogisticRegression | Binary/multiclass, interpretable | Fast, interpretable | Linear boundary |
| RandomForestClassifier | General purpose | Robust, handles nonlinearity | Can overfit |
| GradientBoostingClassifier | High accuracy needed | State-of-art performance | Slower training |
| SVC | Small/medium datasets | Effective in high dimensions | Slow on large data |
| XGBClassifier | Competition/production | Fast, accurate | Many hyperparameters |
Regression Models
| Model | Use Case | Pros | Cons |
|---|---|---|---|
| LinearRegression | Baseline, interpretable | Simple, fast | Assumes linearity |
| Ridge/Lasso | Regularization needed | Prevents overfitting | Still linear |
| RandomForestRegressor | General purpose | Handles nonlinearity | Can overfit |
| GradientBoostingRegressor | High accuracy | Excellent performance | Slower |
| SVR | Small datasets | Robust to outliers | Slow scaling |
Best Practices
- Always Use Pipelines: Prevent data leakage by including preprocessing in pipelines
- Stratified Splits: Use stratified sampling for imbalanced classification
- Cross-Validation: Never tune hyperparameters on test data
- Feature Scaling: Apply appropriate scaling for distance-based models
- Random Seeds: Set random_state for reproducibility
- Model Persistence: Use joblib over pickle for large numpy arrays
References
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
gsd-tools
Central utility skill for GSD operations. Provides config parsing, slug generation, timestamps, path operations, and orchestrates calls to other specialized skills. Acts as the unified entry point that the original gsd-tools.cjs provided via its lib/ modules (commands, config, core, init).
model-profile-resolution
Resolve model profile (quality/balanced/budget) at orchestration start and map agents to specific models. Enables cost/quality tradeoffs by selecting appropriate AI models for each agent role.
verification-suite
Plan structure validation, phase completeness checks, reference integrity verification, and artifact existence confirmation. Provides the structured verification layer ensuring GSD artifacts are well-formed and complete.
state-management
STATE.md reading, writing, and field-level updates. Provides cross-session state persistence via .planning/STATE.md with structured fields for current task, completed phases, blockers, decisions, and quick tasks.
git-integration
Git commit patterns, formats, and conventions for GSD methodology. Provides atomic commits per task, structured commit messages, planning file commits, branch management, and milestone tag operations.
frontmatter-parsing
YAML frontmatter parsing and manipulation for .planning/ documents. Provides read, write, update, query, and validation operations on frontmatter blocks in GSD markdown artifacts.
Didn't find tool you were looking for?