Agent skill

openrlhf-training

High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.

Stars 23,776
Forks 2,298

Install this agent skill to your Project

npx add-skill https://github.com/davila7/claude-code-templates/tree/main/cli-tool/components/skills/ai-research/post-training-openrlhf

SKILL.md

OpenRLHF - High-Performance RLHF Training

Quick start

OpenRLHF is a Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration.

Installation:

bash
# Launch Docker container
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN \
  -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash

# Uninstall conflicts
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y

# Install OpenRLHF with vLLM
pip install openrlhf[vllm]

PPO Training (Hybrid Engine):

bash
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "/openrlhf"}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --vllm_gpu_memory_utilization 0.5 \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
  --save_path ./output/llama3-8b-rlhf \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
  --zero_stage 3 --bf16 \
  --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 --normalize_reward \
  --gradient_checkpointing --packing_samples \
  --vllm_enable_sleep --deepspeed_enable_sleep

GRPO Training (Group Normalized Policy Optimization):

bash
# Same command as PPO, but add:
--advantage_estimator group_norm

Common workflows

Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)

Step 1: Train reward model (DPO):

bash
deepspeed --module openrlhf.cli.train_rm \
  --save_path ./output/llama3-8b-rm \
  --save_steps -1 --logging_steps 1 \
  --eval_steps -1 --train_batch_size 256 \
  --micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
  --bf16 --max_epochs 1 --max_len 8192 \
  --zero_stage 3 --learning_rate 9e-6 \
  --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
  --apply_chat_template --chosen_key chosen \
  --rejected_key rejected --flash_attn --gradient_checkpointing

Step 2: PPO training:

bash
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain ./output/llama3-8b-rm \
  --save_path ./output/llama3-8b-ppo \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
  --zero_stage 3 --bf16 \
  --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 --normalize_reward \
  --vllm_enable_sleep --deepspeed_enable_sleep

Workflow 2: GRPO training (no critic model needed)

Memory-efficient alternative to PPO:

bash
ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --advantage_estimator group_norm \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
  --save_path ./output/llama3-8b-grpo \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --bf16 \
  --actor_learning_rate 5e-7 \
  --init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
  --normalize_reward --no_advantage_std_norm

Key GRPO parameters:

  • --advantage_estimator group_norm - Enables GRPO
  • --use_kl_loss - KL loss from GRPO paper
  • --kl_estimator k3 - Loss function (k2 ≈ k1)
  • --no_advantage_std_norm - Disables std normalization

Workflow 3: DPO training (preference optimization)

Simpler alternative without reward model:

bash
deepspeed --module openrlhf.cli.train_dpo \
  --save_path ./output/llama3-8b-dpo \
  --save_steps -1 --logging_steps 1 \
  --eval_steps -1 --train_batch_size 256 \
  --micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
  --bf16 --max_epochs 1 --max_len 8192 \
  --zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
  --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
  --apply_chat_template --chosen_key chosen \
  --rejected_key rejected --flash_attn --gradient_checkpointing

When to use vs alternatives

Use OpenRLHF when:

  • Training large models (7B-70B+) with RL
  • Need vLLM inference acceleration
  • Want distributed architecture with Ray
  • Have multi-node GPU cluster
  • Need PPO/GRPO/RLOO/DPO in one framework

Algorithm selection:

  • PPO: Maximum control, best for complex rewards
  • GRPO: Memory-efficient, no critic needed
  • RLOO: Modified PPO with per-token KL
  • REINFORCE++: More stable than GRPO, faster than PPO
  • DPO: Simplest, no reward model needed

Use alternatives instead:

  • TRL: Single-node training, simpler API
  • veRL: ByteDance's framework for 671B models
  • DeepSpeedChat: Integrated with DeepSpeed ecosystem

Common issues

Issue: GPU OOM with large models

Disable model colocation:

bash
# Remove --colocate_all_models flag
# Allocate separate GPUs for each model
--actor_num_gpus_per_node 8 \
--critic_num_gpus_per_node 8 \
--reward_num_gpus_per_node 8 \
--ref_num_gpus_per_node 8

Issue: DeepSpeed GPU index out of range

Set environment variable:

bash
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1

Issue: Training instability

Use Hybrid Engine instead of async:

bash
--colocate_all_models \
--vllm_enable_sleep \
--deepspeed_enable_sleep

Adjust KL coefficient:

bash
--init_kl_coef 0.05  # Increase from 0.01

Issue: Slow generation during PPO

Enable vLLM acceleration:

bash
--vllm_num_engines 4 \
--vllm_tensor_parallel_size 2 \
--vllm_gpu_memory_utilization 0.5

Advanced topics

Hybrid Engine GPU sharing: See references/hybrid-engine.md for vLLM sleep mode, DeepSpeed sleep mode, and optimal node allocation.

Algorithm comparison: See references/algorithm-comparison.md for PPO vs GRPO vs RLOO vs REINFORCE++ benchmarks and hyperparameters.

Multi-node setup: See references/multi-node-training.md for Ray cluster configuration and fault tolerance.

Custom reward functions: See references/custom-rewards.md for reinforced fine-tuning and agent RLHF.

Hardware requirements

  • GPU: NVIDIA A100/H100 recommended
  • VRAM:
    • 7B model: 8× A100 40GB (Hybrid Engine)
    • 70B model: 48× A100 80GB (vLLM:Actor:Critic = 1:1:1)
  • Multi-node: Ray cluster with InfiniBand recommended
  • Docker: NVIDIA PyTorch container 25.02+

Performance:

  • 2× faster than DeepSpeedChat
  • vLLM inference acceleration
  • Hybrid Engine minimizes GPU idle time

Resources

Expand your agent's capabilities with these related and highly-rated skills.

davila7/claude-code-templates

verl-rl-training

Provides guidance for training LLMs with reinforcement learning using verl (Volcano Engine RL). Use when implementing RLHF, GRPO, PPO, or other RL algorithms for LLM post-training at scale with flexible infrastructure backends.

23,776 2,298
Explore
davila7/claude-code-templates

gguf-quantization

GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use when deploying models on consumer hardware, Apple Silicon, or when needing flexible quantization from 2-8 bit without GPU requirements.

23,776 2,298
Explore
davila7/claude-code-templates

Claude Code Guide

Master guide for using Claude Code effectively. Includes configuration templates, prompting strategies "Thinking" keywords, debugging techniques, and best practices for interacting with the agent.

23,776 2,298
Explore
davila7/claude-code-templates

qdrant-vector-search

High-performance vector similarity search engine for RAG and semantic search. Use when building production RAG systems requiring fast nearest neighbor search, hybrid search with filtering, or scalable vector storage with Rust-powered performance.

23,776 2,298
Explore
davila7/claude-code-templates

behavioral-modes

AI operational modes (brainstorm, implement, debug, review, teach, ship, orchestrate). Use to adapt behavior based on task type.

23,776 2,298
Explore
davila7/claude-code-templates

prompt-engineering

Expert guide on prompt engineering patterns, best practices, and optimization techniques. Use when user wants to improve prompts, learn prompting strategies, or debug agent behavior.

23,776 2,298
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results