Tool Experimenter

Overview

This skill enables systematic experimentation and improvement of Bible study tools through extensive iterative testing. The process runs ~50+ experiments across multiple rounds before a tool is deemed production-ready, progressing from broad strategic experiments through deep refinement to final optimization. The skill orchestrates the bible-researcher agent to run experiments, evaluates outcomes, and iteratively refines approaches until diminishing returns are reached.

When to Use This Skill

Use this skill when:

User says "experiment with {tool-name}" or "run experiments on {tool-name}" or "improve {tool-name}"
A Bible study tool needs empirical validation and improvement
Different approaches to a research task need to be compared
A tool's effectiveness needs to be tested across diverse verse types

Experimentation Scale

Expected Experiment Count: ~51 runs minimum before production deployment

Breakdown:

Round 1 (Initial Broad Experiments): 3 approaches × 3 verses = 9 runs
Rounds 2-5 (Per-Experiment Refinement): 3 approaches × 4-6 verses × 3-4 revisions = 36-54 runs
Rounds 6-8 (Winner Deep Refinement): 5-10 additional verses × 2-3 revisions = 10-30 runs
Round 9 (Optimization): Remove elements while maintaining quality = 5-10 runs
Total: 60-103 runs before declaring tool production-ready

Key Files

Filenames are relative to project root

@bible-study-tools/TEMPLATE.md - The template and instructions for a tool's README.md file
@STANDARDIZATION.md - Naming conventions for bible books, versions, citations, languages, etc.
@REVIEW-GUIDELINES.md - Detailed instructions on how to review a biblical YAML file for accuracy

Core Workflow

Phase 1: Initialization and Planning

Load Tool Context
- Create and switch to a git branch following git-flow naming conventions
  - Example: feat/tool-sermon-illustrations
- Read ./bible-study-tools/{tool-name}/README.md (if exists) to understand:
  - The tool's purpose and research methodology
  - The current schema and data structure
  - Expected outputs and quality criteria
- Read ./bible-study-tools/{tool-name}/LEARNINGS.md (if exists) to understand:
  - Past experiment results
  - Known challenges and solutions
  - Techniques that worked well or poorly

If the tool does not exist yet:

If the tool does not exist, make sure you know what they are trying to achieve and create a new tool for it following the template in ./bible-study-tools/TEMPLATE.md

Create Experiment Infrastructure
- Create directory: ./bible-study-tools/{tool-name}/experiments/ (if not exists)
- Create subdirectory: ./bible-study-tools/{tool-name}/experiments/output/ for all YAML outputs
- This will contain all experimental outputs and variations
Design Initial Experiments
- Generate 3 fundamentally diverse experimental approaches that test different strategic directions
- Use deep thinking to identify truly different approaches, not just variations
Example: If the tool is about Greek words:
- Experiment A: Focus exclusively on Strong's concordance data
- Experiment B: Focus on morphological analysis (word forms, tenses, cases)
- Experiment C: Focus on semantic domains and conceptual relationships
Example: If the tool is asking for sermon illustrations:
- Experiment A: Find and extract from actual YouTube sermon transcripts for this verse
- Experiment B: Research cultural touchstones (movies, books, art, history) that illustrate the theme
- Experiment C: Search sermon illustration databases and compile existing pastor-tested content
Critical Guidelines:
- Each experiment should pursue a different philosophical approach
- Experiments should test different hypotheses about what makes the tool valuable
- Think broadly: different data sources, different organizational structures, different levels of detail
- Document each experiment's thesis/hypothesis clearly

Phase 2: Round 1 - Initial Broad Experiments

Goal: Establish 3 baseline approaches with initial data Target Runs: 9 (3 experiments × 3 verses)

For each of the 3 initial experiments:

Create Experiment Schema
- Copy the base README.md schema as a starting point
- Modify it according to the experiment's thesis
- Target: Maximum 150 lines (focus on what's unique to this experiment)
- Save as: ./bible-study-tools/{tool-name}/experiments/{experiment-name}/README-rev1.md
Test on Representative Verses
- Select 3 diverse test verses:
  - Verse 1: Well-known, rich context (e.g., John 3:16, John 1:1)
  - Verse 2: Moderate complexity (e.g., Matthew 5:3, Colossians 3:1)
  - Verse 3: Obscure or challenging (e.g., Habakkuk 3:9, Job 38:36)
Generate Data with bible-researcher Agent
- Invoke the bible-researcher agent in parallel for all 9 combinations (3 experiments × 3 verses) with:
  - The experiment's modified README.md
  - The verse reference
- Save outputs as: ./bible-study-tools/{tool-name}/experiments/output/{BOOK}-{CH:03d}-{VS:03d}-{experiment-name}-rev1.yaml

Initial Assessment

Review all 9 generated YAML files
Review agent feedback (Top 3 Insights, Challenges, Quality Metrics)
Identify major blockers or fundamental issues for each experiment
Create initial comparison: which experiments show promise, which have critical blockers?

Source/Method Optimization Analysis (Critical for Early Phases):

For each experiment, identify the primary data access method:
- Web Tools (WebFetch): Does the source have predictable URL schemas or params?
  - ✅ GOOD: https://example.com/verse/{book}/{chapter}/{verse} (can be templated)
  - ✅ GOOD: https://example.com/search?ref={book}+{chapter}:{verse} (clear params)
  - ❌ POOR: Requires complex navigation or search-then-click workflows
  - ❌ POOR: Inconsistent URL structures that can't be predicted
- WebSearch: Document which search query patterns work best
- MCP/Skills/Scripts: Evaluate if custom tooling would be more efficient
Decision Criteria: Prioritize experiments that use predictable, scalable access methods

Documentation: In LEARNINGS-round1.md, create table:

Experiment	Access Method	URL Pattern	Predictability	Scalability
Exp A	WebFetch	example.com/verse/{ref}	High	Excellent
Exp B	WebSearch	Various queries	Medium	Good
Exp C	Manual navigation	N/A	Low	Poor

Document in: ./bible-study-tools/{tool-name}/experiments/LEARNINGS-round1.md
Commit with: chore: round 1 initial experiments

Parallelization Strategy:

Run all 9 agents concurrently (3 experiments × 3 verses)
This significantly reduces total execution time

Completion Criteria: All 3 experiments have been tested on all 3 verses (9 runs complete)

Phase 3: Rounds 2-5 - Per-Experiment Iterative Refinement

Goal: Refine each experiment until confident it represents its approach well Target Runs: 36-54 additional runs (12-18 runs per experiment)

For each of the 3 experiments independently, iterate until the approach is solid:

Analyze Agent Feedback
- Read agent reports thoroughly: what worked, what struggled, quality scores
- Identify patterns across the test verses
- Prioritize fixes: blockers first, then quality improvements
- Ask: "Is this experiment's thesis viable, or are there fundamental barriers?"
Create Revision
- Update README to address issues (rev2, rev3, rev4, etc.)
- Document specific changes made and hypothesis for improvement
- Keep refinements aligned with experiment's core thesis
- Don't abandon the experiment's unique approach to converge toward another experiment
Test Refinement Across Rounds

Round 2 (first refinement):
- Re-run same 3 verses with updated README-rev2
- Compare quality vs. rev1: did issues get fixed?
- Agent quality scores should improve by 10-20%
Round 3 (second refinement):
- If Round 2 showed improvement: refine further based on new feedback
- If Round 2 failed to improve: try fundamentally different fixes
- Re-run same 3 verses or add 1-2 new similar verses
Round 4 (expansion):
- Add 2-3 NEW test verses to verify consistency beyond initial set
- Choose verses with similar characteristics to test robustness
- Example: If Habakkuk 3:9 was challenging, try other obscure verses (Nahum 2:1, Zephaniah 2:14)
Round 5 (stability check):
- Run latest revision on 3-5 additional verses
- Ensure approach works reliably across broader verse set
- Quality should be consistent (variation < 15%)
Refinement Stopping Criteria

Stop iterating on this experiment when:
- ✅ Quality scores consistently high (8+/10) across all test verses
- ✅ Agent feedback shows no major blockers or struggles
- ✅ Approach works reliably for its intended use case
- ✅ Diminishing returns (improvements between revisions < 5%)
OR when fundamental issues can't be fixed:
- ❌ Technical blockers (access denied, infrastructure limitations)
- ❌ Approach doesn't align with project goals after multiple attempts
- ❌ Quality ceiling is too low despite multiple refinement rounds
Document Refinement Learnings
- Update ./bible-study-tools/{tool-name}/experiments/LEARNINGS-round{N}.md for each major round
- Note what changed, why, and impact on quality
- Track revision history and rationale
Review Committee Tracking (Rounds 2-5):
- In early rounds (2-3), use broad review committee with many reviewers asking many questions:
  - Example reviewers: Scholarly Accuracy Reviewer, Translation Sensitivity Reviewer, Contextual Completeness Reviewer, Source Reliability Reviewer, Theological Balance Reviewer, AI Usability Reviewer, Practical Application Reviewer, Cultural Context Reviewer, etc.
  - Each asks 5-10 questions covering their domain
  - Goal: Cast wide net to catch all types of issues
- Document which reviewers found which issues:
  markdown
```
## Review Committee Results - Round 2

| Reviewer | Question | Issue Found | Fixed? | Impact |
|----------|----------|-------------|--------|--------|
| Scholarly Accuracy | "Are all dates verified?" | Found unverified claim about 586 BC | Yes | Medium |
| Source Reliability | "Can all sources be accessed?" | Found 3 broken URLs | Yes | High |
| AI Usability | "Is citation format consistent?" | Found 12 inconsistent citations | Yes | Low |
| Practical Application | "Would pastors use this?" | N/A - no issues | N/A | N/A |
| Cultural Context | "Are non-Western perspectives included?" | Limited to Western sources | No | Low |
```
- Track effectiveness: Which reviewers consistently find real issues? Which questions catch problems?
- Commit after each round: chore: round {N} refinements for {experiment-name}

Expected Completion: Each experiment should have 4-6 revisions tested on 6-12 verses total

Note: Experiments may complete at different rates. Some may reach quality in 3 rounds (9-15 runs), others may need 5-6 rounds (15-24 runs). Some may be abandoned due to fundamental blockers.

Phase 4: Round 6 - Cross-Experiment Evaluation and Selection

Goal: Determine which approach(es) to advance to production

Comprehensive Cross-Experiment Analysis

Compare final outputs from all experiments that completed refinement
Create detailed comparison table in ./bible-study-tools/{tool-name}/experiments/LEARNINGS.md:

Evaluation Criterion	Exp A	Exp B	Exp C	Notes
Data Extraction Success Rate	X/Y attempts	X/Y attempts	X/Y attempts	What blocked failures?
Average Quality Score	X.X/10	X.X/10	X.X/10	Across all test verses
Consistency Across Verses	High/Med/Low	High/Med/Low	High/Med/Low	Variance in quality
Source Accessibility	% accessible	% accessible	% accessible	Infrastructure barriers
Target Audience Fit	Rating	Rating	Rating	Would pastors/translators use this?
Methodology Adherence	%	%	%	Deviations required?
AI-Grounding Value	High/Med/Low	High/Med/Low	High/Med/Low	Context depth
Scalability	High/Med/Low	High/Med/Low	High/Med/Low	Can this work for all 31,000 verses?

Weight criteria based on project priorities
Calculate overall scores

Decision Point

Determine the next course of action:

Option A - Clear Winner:
- One experiment significantly outperforms others (2+ points higher on 10-point scale)
- Proceed to Phase 5 (Deep Refinement) with the winning approach only
- Archive other experiments for reference
Option B - Complementary Approaches:
- Multiple experiments provide different valuable insights that don't conflict
- Example: One excels at well-known verses, another at obscure ones
- Generate 2-3 "blend" experiments combining best elements
- Run blends through rounds similar to Phase 3 (another 12-24 runs)
Option C - Fundamentally Different Tools:
- Experiments reveal distinct, incompatible value propositions
- Split into multiple separate tools
- Each pursues its own approach independently
Option D - All Experiments Insufficient:
- None of the approaches meet quality threshold (all < 6/10)
- Generate 3 entirely new experimental approaches based on learnings
- Return to Phase 2 with new experiments (another 9+ runs)
Document Decision
- Update ./bible-study-tools/{tool-name}/experiments/LEARNINGS.md with:
  - Executive summary of winner/decision
  - Rationale for selection
  - What made winner superior
  - What was learned from non-winning experiments
- Commit: docs: round 6 evaluation and selection

Phase 5: Rounds 7-8 - Deep Refinement of Winner

Goal: Refine the winning approach to production quality Target Runs: 10-30 additional runs

Now that the strategic direction is clear, focus exclusively on making it excellent:

Expand Test Verse Coverage
- Test winning approach on 10-15 additional diverse verses
- Ensure coverage across:
  - Old Testament and New Testament
  - Narrative, poetry, prophecy, epistles, wisdom literature
  - Well-known and obscure passages
  - Theologically complex and straightforward verses
Structural Refinements (if needed)
- Experiment with different schema organizations
- Test different levels of detail or groupings
- Example: Flat vs. nested structures, different field names, optional vs. required fields
- Run 3-5 verses with each structural variant
- Select structure that best balances completeness and usability
Methodological Refinements
- Refine research techniques based on agent feedback
- Optimize source priorities (which to check first)
- Test different search strategies
- Example: Different website access patterns, query formulations
- Run 3-5 verses with each methodology variant
Prompt and Instruction Refinements (most granular)
- Experiment with different instruction phrasings in README
- Test different examples or constraints
- Example: Different ways to request citations, emphasis on thoroughness vs. conciseness
- Run 3-5 verses with each prompt variant
- Look for subtle quality improvements (5-10% gains)
Quality Consistency Check
- Final check: run 5-10 random verses to ensure:
  - Quality scores consistently 8.5+/10
  - Agent feedback is consistently positive
  - Outputs are immediately usable
  - No surprising failures or edge cases

Review Committee Optimization (Rounds 7-8)

Based on learnings from Rounds 2-5, optimize the review committee:

Analysis:

Review all "Review Committee Results" tables from previous rounds

Calculate effectiveness metrics:

Reviewer Effectiveness = (Issues Found / Questions Asked) × Average Impact
Question Effectiveness = (Times It Found Issues / Times It Was Asked)

Optimization:

Remove low-value reviewers: Those who found 0 issues across multiple rounds
Remove redundant questions: Questions that never caught issues
Keep high-value reviewers: Those who consistently find medium/high impact issues
Refine questions: Make questions more targeted based on what actually caught problems

Example Optimization:

markdown

## Optimized Review Committee - Rounds 7-8

**Removed (0 issues found across 4 rounds):**
- Practical Application Reviewer (all outputs were practical)
- Readability Reviewer (never found issues)

**Kept and Refined:**
- Source Reliability Reviewer → Focus on 2 key questions that caught 80% of issues:
  1. "Can all URLs be accessed programmatically?"
  2. "Are all factual claims traceable to sources?"
- Scholarly Accuracy Reviewer → Focus on 3 questions that found real problems:
  1. "Are all historical dates verified?"
  2. "Are theological claims balanced?"
  3. "Are language-specific claims checked against lexicons?"

**Result:** 8 reviewers → 3 reviewers, 64 questions → 12 focused questions

Document optimization rationale in LEARNINGS-round{N}.md
Goal: Least reviewers with most focused questions that catch most problems

Stopping Criteria:

Quality is consistently excellent (8.5+/10)
Agent feedback shows confidence and satisfaction
Outputs meet all project standards
Optimized review committee catches issues efficiently
Diminishing returns reached (improvements < 3% between iterations)

Expected Completion: Winner has been tested on 20-30 total verses with final schema/methodology

Phase 6: Round 9 - Optimization (Reduction While Maintaining Quality)

Goal: Remove unnecessary elements while maintaining output quality Target Runs: 5-10 runs

This phase determines what can be removed or simplified without losing quality:

Schema Optimization
- Identify optional fields that are rarely populated or low-value
- Test removing them: does quality drop?
- Example: If "historical_context.political_situation" is always empty or generic, remove it
- Run 3-5 verses without the field
- If quality maintained, keep it removed; if quality drops, restore it
Instruction Simplification
- Identify complex instructions in README that may be unnecessary
- Test simpler phrasing: does agent still produce quality output?
- Remove redundant guidance
- Run 3-5 verses with simplified README
- Aim for: shortest README that maintains 8.5+/10 quality
Source Optimization
- Identify which sources consistently provide value vs. those rarely helpful
- Test removing low-value sources from required list
- Does research become faster without quality loss?
Final Validation
- Run 5-10 random verses with optimized schema/README
- Ensure quality remains at 8.5+/10
- Confirm outputs are leaner but equally valuable
- Agent feedback should indicate process is streamlined, not stripped

Completion Criteria:

Tool is as simple as possible while maintaining quality
No unnecessary fields, instructions, or requirements
Efficient for agents to execute
Outputs are concise but complete

Phase 7: Synthesis and Production Deployment

Update Tool README.md
- Incorporate the final optimized approach
- Update schema with validated improvements
- Add stellar examples discovered during experiments to the examples section
- Ensure research methodology reflects proven effective techniques
- Include troubleshooting guidance based on challenges documented
- Target: 150-250 lines (concise but complete)
Update Tool Registry
- Register or update the tool in /bible-study-tools/tool-registry.yaml
- Based on experimentation results, determine appropriate metadata:
  - Scope: Based on final optimized file sizes
    - core: If files are consistently small (<50 KB) and essential for all queries
    - standard: If files are moderate (50-500 KB) and valuable for typical study
    - comprehensive: If files are large (>500 KB) or only needed for exhaustive analysis
  - Category: lexical|theological|practical|historical|linguistic|topical
  - Summary: Concise description (max 20 words) of what the tool provides
- Example registry entry:
  yaml
```
{tool-suffix}:
  name: {Tool Name}
  summary: {Brief description from experimentation}
  scope: {core|standard|comprehensive}
  category: {appropriate category}
```
Consolidate LEARNINGS.md
- Synthesize learnings from all experiment rounds into concise summary
- Create ./bible-study-tools/{tool-name}/LEARNINGS.md with format:
  - Round Summary: Brief overview of each round
  - What Worked Well: Successful techniques, helpful websites, effective schema fields
  - What Worked Poorly: Failed approaches, unhelpful sources, problematic structures
  - Key Insights: Breakthrough discoveries during experimentation
  - Stellar Examples: Best outputs to reference for quality standards
  - Next Steps: Future refinement ideas
- Preserve detailed round documentation in LEARNINGS-round{N}.md files for reference
- Target: LEARNINGS.md = 100-150 lines; detailed round files can be longer

Preserve Experiment History

Keep all experiment folders for future reference
Maintain audit trail of tool's evolution

Structure:

./experiments/
  ├── {experiment-name-A}/
  │   ├── README-rev1.md
  │   ├── README-rev2.md
  │   └── ...
  ├── {experiment-name-B}/
  ├── {experiment-name-C}/
  ├── output/
  │   ├── {BOOK}-{CH}-{VS}-{experiment-name}-rev{N}.yaml
  │   └── ...
  ├── LEARNINGS.md (concise summary)
  ├── LEARNINGS-round1.md (detailed)
  ├── LEARNINGS-round2.md (detailed)
  └── ...

Production Readiness Checklist

Before declaring tool production-ready, verify:
- ✅ Tested on 25+ diverse verses
- ✅ Quality consistently 8.5+/10 across verse types
- ✅ README is clear, concise (150-250 lines)
- ✅ Schema is validated and optimized
- ✅ LEARNINGS.md documents what works and what doesn't
- ✅ Stellar examples are included in README
- ✅ No unresolved critical issues
- ✅ Tool is scalable (can be applied to all 31,000 verses without modification)
Final Comprehensive Validation

Goal: Go back over ALL key files and validate EVERY requirement was met

This is the final review before production deployment. Be thorough and systematic.

Step 1: Validate Tool Registry Entry
- Tool is registered in /bible-study-tools/tool-registry.yaml
- Scope is appropriate based on file sizes:
  - core if files <50 KB and essential for all queries
  - standard if files 50-500 KB and useful for typical study
  - comprehensive if files >500 KB or only for exhaustive analysis
- Category accurately reflects tool's purpose
- Summary is concise (max 20 words) and describes value
- Tool suffix matches filename pattern used in outputs
Step 2: Validate Tool README.md
- Purpose statement is clear and compelling
- Target audience is explicitly defined (pastors, translators, students, etc.)
- Research methodology is documented with specific steps
- All required sources are listed with access methods
- Schema is complete with all required fields documented
- Source access method is optimized:
  - If WebFetch: URL pattern is predictable and templatable
  - If WebSearch: Query patterns are documented
  - If MCP/Script: Integration is documented
- 3-5 stellar examples are included from experiments
- Examples cover diverse verse types (well-known, moderate, obscure)
- Citation format follows @STANDARDIZATION.md
- Quality standards are defined (what makes output excellent?)
- Troubleshooting section based on experiment challenges
- File is 150-250 lines (concise but complete)
Step 3: Validate LEARNINGS.md
- Concise summary format (100-150 lines)
- All rounds are summarized briefly
- "What Worked Well" section lists specific techniques/sources
- "What Worked Poorly" section documents failures
- Key insights are clearly articulated
- Source/method optimization is documented:
  - Table showing access methods and predictability
  - Rationale for chosen approach
- Review committee evolution is documented:
  - Starting committee composition and question count
  - Ending committee composition and question count
  - Rationale for optimization (which reviewers/questions were effective)
- Stellar examples are highlighted
- Next steps for future improvements are listed
- References to detailed LEARNINGS-round{N}.md files
Step 4: Validate Detailed Round Files (LEARNINGS-round{N}.md)
- Each round file documents what changed and why
- Quality improvements are quantified (scores, metrics)
- Review Committee Results tables are present:
  - Documents which reviewers found which issues
  - Documents which questions caught problems
  - Tracks effectiveness over rounds
- Agent feedback is preserved (Top 3 Insights, Challenges, Quality Metrics)
- Blockers and solutions are documented
- Files are preserved for audit trail
Step 5: Validate Experiment README Files
- Each experiment's thesis is clearly stated
- README-rev1 through README-revN show evolution
- Final revision represents optimized approach
- Schema changes are documented with rationale
- Instruction changes are documented with impact
- Files demonstrate iterative improvement
Step 6: Validate Output Files
- Sample outputs from multiple rounds are preserved
- Naming convention is consistent: {BOOK}-{CH:03d}-{VS:03d}-{experiment-name}-rev{N}.yaml
- Final outputs demonstrate quality improvement over rev1
- At least 3 stellar examples are identified for README
- Outputs follow @STANDARDIZATION.md citation format
- YAML is valid and parseable
Step 7: Validate Against Original Requirements

Go back to the original tool request and verify:
- Tool addresses the original problem statement
- Scope is appropriate (not too narrow, not too broad)
- Output format meets user needs
- Data quality enables AI grounding
- Tool is practical for target audience
- Scalability to 31,000 verses is feasible
Step 8: Validate Optimization Completeness
- Source/Method: Chosen approach is most efficient (predictable URLs > search > manual)
- Review Committee: Optimized to minimum effective reviewers and questions
- Schema: No unnecessary fields (tested removal in Round 9)
- Instructions: As concise as possible while maintaining quality
- Performance: No wasted tool calls or redundant steps
Step 9: Final Quality Spot Check

Run the optimized tool on 3 random verses (not previously tested):
- Quality scores are 8.5+/10
- Optimized review committee catches any issues
- Outputs are immediately usable
- Process is efficient (no wasted steps)
- Agent reports confidence and satisfaction
If any items fail validation:
- Document the gap
- Create targeted fix
- Re-test affected areas
- Update relevant files
- Re-run validation
When all validation passes:
- Tool is production-ready
- Proceed to Step 6 (Commit and Create PR)

Commit and Create PR

Commit with comprehensive message documenting experimentation journey

Example:

feat: complete sermon-illustrations tool experimentation

- Completed 9 rounds of experimentation (68 total runs)
- Tested 28 diverse verses across OT and NT
- Cultural-artifacts approach emerged as winner (9.3/10 quality)
- Schema optimized from 45 fields to 32 essential fields
- README refined from 320 lines to 180 lines
- Ready for production deployment

Create PR with summary of experimentation process and results
Link to key LEARNINGS files for reviewers

File Naming Conventions

Follow these strict naming patterns:

Experiment Schema Revisions:

./bible-study-tools/{tool-name}/experiments/{experiment-name}/README-rev{N}.md

Experiment Data Outputs:

./bible-study-tools/{tool-name}/experiments/output/{BOOK}-{CH:03d}-{VS:03d}-{experiment-name}-rev{N}.yaml

Experiment Learnings (Concise Summary):

./bible-study-tools/{tool-name}/experiments/LEARNINGS.md

Experiment Learnings (Detailed Round Documentation):

./bible-study-tools/{tool-name}/experiments/LEARNINGS-round{N}.md

Where:

{tool-name}: The Bible study tool being experimented on
{experiment-name}: A descriptive name for the experimental approach (e.g., "cultural-artifacts", "preacher-transcripts")
{N}: Revision or round number (1, 2, 3, etc.)
{BOOK}: USFM 3.0 three-letter book code (MAT, JHN, GEN, etc.)
{CH:03d}: Zero-padded chapter number (001, 005, 038, etc.)
{VS:03d}: Zero-padded verse number (001, 016, etc.)

Test Verse Selection Strategy

Choose test verses that represent different complexity levels:

High-Context Verses (well-studied, rich theological content):

John 1:1, John 3:16, Matthew 5:3-10, Romans 3:23, Genesis 1:1, Psalm 23:1

Medium-Context Verses (moderately studied, specific contexts):

Colossians 3:1, 1 Samuel 15:3, Daniel 9:25, Psalm 119:105, Acts 17:28

Low-Context Verses (obscure, challenging, less-studied):

Job 38:36, Habakkuk 3:9, Nahum 2:1, Zephaniah 2:14, 3 John 1:4

Genre Diversity:

Narrative: Genesis 1:1, Acts 2:1
Poetry: Psalm 23:1, Psalm 119:105
Prophecy: Isaiah 53:5, Habakkuk 3:9
Epistles: Romans 3:23, Colossians 3:1
Wisdom: Proverbs 3:5, Job 38:36
Apocalyptic: Revelation 21:1

Mix categories to ensure experiments work across the full spectrum of Biblical literature.

Quality Standards

Each experiment should:

Stay true to its core thesis/hypothesis
Generate valid YAML that conforms to schema
Include proper citations following project standards (@STANDARDIZATION.md)
Provide substantial, AI-grounding content (not superficial summaries)
Work consistently across different verse types
Scale to entire Bible without modification

Success Metrics

The experimentation process is complete when:

Sufficient testing: 50+ experiment runs completed across 25+ verses
Clear winner: A winning approach or blend is validated
Consistent quality: Approach works reliably (8.5+/10) across diverse verse types
Optimization complete: Tool is as simple as possible while maintaining quality
Diminishing returns: Further improvements are < 3% per iteration
Production ready: README, LEARNINGS, and stellar examples are finalized
Documented thoroughly: Future tool usage and development is informed by empirical evidence

Expected Timeline

Rough estimates:

Phase 1 (Planning): 1-2 hours
Phase 2 (Round 1): 2-4 hours (9 parallel runs)
Phase 3 (Rounds 2-5): 6-10 hours (36-54 runs across 3 experiments)
Phase 4 (Round 6 Evaluation): 1-2 hours
Phase 5 (Rounds 7-8 Deep Refinement): 4-8 hours (10-30 runs)
Phase 6 (Round 9 Optimization): 2-3 hours (5-10 runs)
Phase 7 (Synthesis): 2-3 hours

Total: 18-32 hours of agent work for a fully validated, production-ready tool

This investment ensures tools are empirically validated rather than based on assumptions, leading to higher quality AI-grounding data for the entire Bible.

Search AI Tools

tool-experimenter

Install this agent skill to your Project

SKILL.md