Agent skills
evaluating-llms-harness

Agent skill

evaluating-llms-harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

View SKILL.md on GitHub Repository

Stars 1,415

Forks 109

Install this agent skill to your Project

npx add-skill https://github.com/foryourhealth111-pixel/Vibe-Skills/tree/main/bundled/skills/evaluating-llms-harness

SKILL.md

lm-evaluation-harness - LLM Benchmarking

Quick start

lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

Installation:

bash

pip install lm-eval

Evaluate any HuggingFace model:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

View available tasks:

bash

lm_eval --tasks list

Common workflows

Workflow 1: Standard benchmark evaluation

Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

Copy this checklist:

Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results

Step 1: Choose benchmark suite

Core reasoning benchmarks:

MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
GSM8K - Grade school math word problems
HellaSwag - Common sense reasoning
TruthfulQA - Truthfulness and factuality
ARC (AI2 Reasoning Challenge) - Science questions

Code benchmarks:

HumanEval - Python code generation (164 problems)
MBPP (Mostly Basic Python Problems) - Python coding

Standard suite (recommended for model releases):

bash

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

Step 2: Configure model

HuggingFace model:

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size auto  # Auto-detect optimal batch size

Quantized model (4-bit/8-bit):

bash

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
  --tasks mmlu \
  --device cuda:0

Custom checkpoint:

bash

lm_eval --model hf \
  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
  --tasks mmlu \
  --device cuda:0

Step 3: Run evaluation

bash

# Full MMLU evaluation (57 subjects)
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --num_fewshot 5 \  # 5-shot evaluation (standard)
  --batch_size 8 \
  --output_path results/ \
  --log_samples  # Save individual predictions

# Multiple benchmarks at once
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
  --num_fewshot 5 \
  --batch_size 8 \
  --output_path results/llama2-7b-eval.json

Step 4: Analyze results

Results saved to results/llama2-7b-eval.json:

json

{
  "results": {
    "mmlu": {
      "acc": 0.459,
      "acc_stderr": 0.004
    },
    "gsm8k": {
      "exact_match": 0.142,
      "exact_match_stderr": 0.006
    },
    "hellaswag": {
      "acc_norm": 0.765,
      "acc_norm_stderr": 0.004
    }
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
    "num_fewshot": 5
  }
}

Workflow 2: Track training progress

Evaluate checkpoints during training.

Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves

Step 1: Set up periodic evaluation

Evaluate every N training steps:

bash

#!/bin/bash
# eval_checkpoint.sh

CHECKPOINT_DIR=$1
STEP=$2

lm_eval --model hf \
  --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
  --tasks gsm8k,hellaswag \
  --num_fewshot 0 \  # 0-shot for speed
  --batch_size 16 \
  --output_path results/step-$STEP.json

Step 2: Choose quick benchmarks

Fast benchmarks for frequent evaluation:

HellaSwag: ~10 minutes on 1 GPU
GSM8K: ~5 minutes
PIQA: ~2 minutes

Avoid for frequent eval (too slow):

MMLU: ~2 hours (57 subjects)
HumanEval: Requires code execution

Step 3: Automate evaluation

Integrate with training script:

python

# In training loop
if step % eval_interval == 0:
    model.save_pretrained(f"checkpoints/step-{step}")

    # Run evaluation
    os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

Or use PyTorch Lightning callbacks:

python

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):
    def on_validation_epoch_end(self, trainer, pl_module):
        step = trainer.global_step
        checkpoint_path = f"checkpoints/step-{step}"

        # Save checkpoint
        trainer.save_checkpoint(checkpoint_path)

        # Run lm-eval
        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

Step 4: Plot learning curves

python

import json
import matplotlib.pyplot as plt

# Load all results
steps = []
mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")):
    with open(file) as f:
        data = json.load(f)
        step = int(file.split("-")[1].split(".")[0])
        steps.append(step)
        mmlu_scores.append(data["results"]["mmlu"]["acc"])

# Plot
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")

Workflow 3: Compare multiple models

Benchmark suite for model comparison.

Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table

Step 1: Define model list

bash

# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2

Step 2: Run evaluations

bash

#!/bin/bash
# eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do
    echo "Evaluating $model"

    # Extract model name for output file
    model_name=$(echo $model | sed 's/\//-/g')

    lm_eval --model hf \
      --model_args pretrained=$model,dtype=bfloat16 \
      --tasks $TASKS \
      --num_fewshot 5 \
      --batch_size auto \
      --output_path results/$model_name.json

done < models.txt

Step 3: Generate comparison table

python

import json
import pandas as pd

models = [
    "meta-llama-Llama-2-7b-hf",
    "meta-llama-Llama-2-13b-hf",
    "mistralai-Mistral-7B-v0.1",
    "microsoft-phi-2"
]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []
for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        row = {"Model": model.replace("-", "/")}
        for task in tasks:
            # Get primary metric for each task
            metrics = data["results"][task]
            if "acc" in metrics:
                row[task.upper()] = f"{metrics['acc']:.3f}"
            elif "exact_match" in metrics:
                row[task.upper()] = f"{metrics['exact_match']:.3f}"
        results.append(row)

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

Output:

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

Workflow 4: Evaluate with vLLM (faster inference)

Use vLLM backend for 5-10x faster evaluation.

vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation

Step 1: Install vLLM

bash

pip install vllm

Step 2: Configure vLLM backend

bash

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
  --tasks mmlu \
  --batch_size auto

Step 3: Run evaluation

vLLM is 5-10× faster than standard HuggingFace:

bash

# Standard HF: ~2 hours for MMLU on 7B model
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --batch_size 8

# vLLM: ~15-20 minutes for MMLU on 7B model
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
  --tasks mmlu \
  --batch_size auto

When to use vs alternatives

Use lm-evaluation-harness when:

Benchmarking models for academic papers
Comparing model quality across standard tasks
Tracking training progress
Reporting standardized metrics (everyone uses same prompts)
Need reproducible evaluation

Use alternatives instead:

HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
AlpacaEval: Instruction-following evaluation with LLM judges
MT-Bench: Conversational multi-turn evaluation
Custom scripts: Domain-specific evaluation

Common issues

Issue: Evaluation too slow

Use vLLM backend:

bash

lm_eval --model vllm \
  --model_args pretrained=model-name,tensor_parallel_size=2

Or reduce fewshot examples:

bash

--num_fewshot 0  # Instead of 5

Or evaluate subset of MMLU:

bash

--tasks mmlu_stem  # Only STEM subjects

Issue: Out of memory

Reduce batch size:

bash

--batch_size 1  # Or --batch_size auto

Use quantization:

bash

--model_args pretrained=model-name,load_in_8bit=True

Enable CPU offloading:

bash

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

Issue: Different results than reported

Check fewshot count:

bash

--num_fewshot 5  # Most papers use 5-shot

Check exact task name:

bash

--tasks mmlu  # Not mmlu_direct or mmlu_fewshot

Verify model and tokenizer match:

bash

--model_args pretrained=model-name,tokenizer=same-model-name

Issue: HumanEval not executing code

Install execution dependencies:

bash

pip install human-eval

Enable code execution:

bash

lm_eval --model hf \
  --model_args pretrained=model-name \
  --tasks humaneval \
  --allow_code_execution  # Required for HumanEval

Advanced topics

Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.

API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.

Hardware requirements

GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)
VRAM:
- 7B model: 16GB (bf16) or 8GB (8-bit)
- 13B model: 28GB (bf16) or 14GB (8-bit)
- 70B model: Requires multi-GPU or quantization
Time (7B model, single A100):
- HellaSwag: 10 minutes
- GSM8K: 5 minutes
- MMLU (full): 2 hours
- HumanEval: 20 minutes

Resources

GitHub: https://github.com/EleutherAI/lm-evaluation-harness
Docs: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)

Maintainer

foryourhealth111-pixel Core maintainer

Source details

Full Name: foryourhealth111-pixel/Vibe-Skills
Branch: main
Path in repo: bundled/skills/evaluating-llms-harness
License: Apache License 2.0
Topics: claude-code anthropic claude agent-skills automation mcp ai-agents cursor developer-tools agentic-coding skills llm codex claude-skills vibe-coding vibecoding opencode ai-skills ai-workflow windsurf

Featured Tools

Join Our Newsletter

Access ZINC (230M+ purchasable compounds). Search by ZINC ID/SMILES, similarity searches, 3D-ready structures for docking, analog discovery, for virtual screening and drug discovery.

1,415 109

Explore

Didn't find tool you were looking for?

Search AI Tools

Install this agent skill to your Project

SKILL.md

lm-evaluation-harness - LLM Benchmarking

Quick start

Common workflows

Workflow 1: Standard benchmark evaluation

Workflow 2: Track training progress

Workflow 3: Compare multiple models

Workflow 4: Evaluate with vLLM (faster inference)

When to use vs alternatives

Common issues

Advanced topics

Hardware requirements

Resources

Recommended Agent Skills

pufferlib

fluidsim

metabolomics-workbench-database

build-error-resolver

geniml

zinc-database