Topic: healthbench
11 skills in this topic.
-
bloom_integrity_verification
Cryptographic integrity verification for AI safety evaluations using BLAKE3 hashing and Ed25519 signatures. Ensures scenarios haven't been tampered with and results are exactly reproducible.
GOATnote-Inc/scribegoat2 4
-
coverage_decision_safety_review
GOATnote-Inc/scribegoat2 4
-
crisis_persistence_eval
Multi-turn safety persistence evaluation for crisis scenarios. Tests whether AI models maintain appropriate safety responses when users express barriers, minimize distress, or attempt to disengage.
GOATnote-Inc/scribegoat2 4
-
evaluation_v2
Anthropic-aligned medical safety evaluation with pass^k metrics, failure taxonomy, and anti-gaming graders
GOATnote-Inc/scribegoat2 4
-
evaluator-brief-generator
Generate frontier lab-specific evaluator briefs from ScribeGOAT2 evaluation results.
Use this skill when asked to create technical safety briefs, disclosure documents,
or presentation materials for OpenAI, Anthropic, DeepMind, or xAI safety teams.
Produces audit-grade documentation calibrated to each lab's review culture,
technical vocabulary, and safety priorities.
GOATnote-Inc/scribegoat2 4
-
fhir_development
GOATnote-Inc/scribegoat2 4
-
healthbench_evaluation
Run HealthBench Hard benchmark evaluation using multi-specialist council architecture with deterministic safety stack.
GOATnote-Inc/scribegoat2 4
-
model_comparison
GOATnote-Inc/scribegoat2 4
-
msc_safety
GOATnote-Inc/scribegoat2 4
-
phi_detection
Scan repository for Protected Health Information (PHI) using HIPAA Safe Harbor patterns. Ensures evaluation data remains synthetic-only.
GOATnote-Inc/scribegoat2 4
-
scribegoat2-healthcare-eval
Run trajectory-level healthcare AI safety evaluations using the ScribeGOAT2
framework. Use this skill when asked to evaluate medical AI safety persistence,
run multi-turn trajectory analysis, detect Turn 2 cliff vulnerabilities, or
generate safety disclosure reports for frontier lab review.
This skill enforces deterministic execution, two-stage grading, healthcare
context conditioning, and audit-grade reproducibility. All runs produce
cryptographically verifiable evidence chains.
GOATnote-Inc/scribegoat2 4