Agent skill
qlora
Memory-efficient fine-tuning with 4-bit quantization and LoRA adapters. Use when fine-tuning large models (7B+) on consumer GPUs, when VRAM is limited, or when standard LoRA still exceeds memory. Builds on the lora skill.
Install this agent skill to your Project
npx add-skill https://github.com/itsmostafa/llm-engineering-skills/tree/main/skills/qlora
SKILL.md
QLoRA: Quantized Low-Rank Adaptation
QLoRA enables fine-tuning of large language models on consumer GPUs by combining 4-bit quantization with LoRA adapters. A 65B model can be fine-tuned on a single 48GB GPU while matching 16-bit fine-tuning performance.
Prerequisites: This skill assumes familiarity with LoRA. See the
loraskill for LoRA fundamentals (LoraConfig, target_modules, training patterns).
Table of Contents
- Core Innovations
- BitsAndBytesConfig Deep Dive
- Memory Requirements
- Complete Training Example
- Inference and Merging
- Troubleshooting
- Best Practices
Core Innovations
QLoRA introduces three techniques that reduce memory usage without sacrificing performance:
4-bit NormalFloat (NF4)
NF4 is an information-theoretically optimal quantization data type for normally distributed weights. Neural network weights are typically normally distributed, making NF4 more efficient than standard 4-bit floats.
Storage: 4-bit NF4 (quantized weights)
Compute: 16-bit BF16 (dequantized for forward/backward pass)
The key insight: weights are stored in 4-bit but dequantized to bf16 for computation. Only the frozen base model is quantized; LoRA adapters remain in full precision.
NF4 vs FP4:
| Quantization | Description | Use Case |
|---|---|---|
nf4 |
Normalized Float 4-bit, optimal for normal distributions | Default, recommended |
fp4 |
Standard 4-bit float | Legacy, rarely needed |
Double Quantization
Standard quantization requires storing scaling constants (typically fp32) for each quantization block. Double quantization quantizes these constants too:
First quantization: weights → 4-bit + fp32 scaling constants
Double quantization: scaling constants → 8-bit + fp32 second-level constants
This saves approximately 0.37 bits per parameter—significant for billion-parameter models:
- 7B model: ~325 MB savings
- 70B model: ~3.2 GB savings
Paged Optimizers
During training, gradient checkpointing can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA unified memory to automatically transfer optimizer states between GPU and CPU:
Normal training: OOM on memory spike
Paged optimizers: GPU ↔ CPU transfer handles spikes gracefully
This is handled automatically by bitsandbytes when using 4-bit training.
BitsAndBytesConfig Deep Dive
All Parameters Explained
from transformers import BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
# Core 4-bit settings
load_in_4bit=True, # Enable 4-bit quantization
bnb_4bit_quant_type="nf4", # "nf4" (recommended) or "fp4"
# Double quantization
bnb_4bit_use_double_quant=True, # Quantize the quantization constants
# Compute precision
bnb_4bit_compute_dtype=torch.bfloat16, # Dequantize to this dtype for compute
# Optional: specific storage type (usually auto-detected)
bnb_4bit_quant_storage=torch.uint8, # Storage dtype for quantized weights
)
Compute Dtype Selection
| Dtype | Hardware | Notes |
|---|---|---|
torch.bfloat16 |
Ampere+ (RTX 30xx, A100) | Recommended, faster |
torch.float16 |
Older GPUs (V100, RTX 20xx) | Use if bf16 not supported |
torch.float32 |
Any | Slower, only for debugging |
Check bf16 support:
import torch
print(torch.cuda.is_bf16_supported()) # True on Ampere+
Comparison: Quantization Options
# Recommended: NF4 + double quant + bf16
optimal_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Maximum memory savings (slightly slower)
max_savings_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16, # fp16 uses less memory than bf16
)
# 8-bit alternative (less compression, sometimes more stable)
eight_bit_config = BitsAndBytesConfig(
load_in_8bit=True,
)
Memory Requirements
| Model Size | Full Fine-tuning | LoRA (16-bit) | QLoRA (4-bit) |
|---|---|---|---|
| 7B | ~60 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 34B | ~272 GB | ~75 GB | ~20 GB |
| 70B | ~560 GB | ~160 GB | ~48 GB |
Notes:
- QLoRA memory includes model + optimizer states + activations
- Actual usage varies with batch size, sequence length, and gradient checkpointing
- Add ~20% buffer for safe operation
GPU Recommendations
| GPU VRAM | Max Model Size (QLoRA) |
|---|---|
| 8 GB | 7B (tight) |
| 16 GB | 7-13B |
| 24 GB | 13-34B |
| 48 GB | 34-70B |
| 80 GB | 70B+ comfortably |
Complete Training Example
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch
# 1. Quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
# 2. Load quantized model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2", # Optional: faster attention
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# 3. Prepare for k-bit training (critical step!)
model = prepare_model_for_kbit_training(model)
# 4. LoRA config (see lora skill for parameter details)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 5. Dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
def format_example(example):
if example["input"]:
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"}
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}
dataset = dataset.map(format_example)
# 6. Training
sft_config = SFTConfig(
output_dir="./qlora-output",
max_seq_length=512,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_steps=100,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
optim="paged_adamw_8bit", # Paged optimizer for memory efficiency
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
processing_class=tokenizer,
dataset_text_field="text",
)
trainer.train()
# 7. Save adapter
model.save_pretrained("./qlora-adapter")
tokenizer.save_pretrained("./qlora-adapter")
Inference and Merging
Inference with Quantized Model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
model_name = "meta-llama/Llama-3.1-8B"
# Load quantized base model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
model.eval()
# Generate
inputs = tokenizer("### Instruction:\nExplain quantum computing.\n\n### Response:\n", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Merging to Full Precision
To merge QLoRA adapters into a full-precision model (for deployment without bitsandbytes):
from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model in full precision (on CPU to avoid OOM)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.bfloat16,
device_map="cpu",
)
# Load adapter
model = PeftModel.from_pretrained(base_model, "./qlora-adapter")
# Merge and unload
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged-model")
Note: Merging requires enough RAM to hold the full-precision model. For 70B models, this means ~140GB RAM.
Troubleshooting
CUDA Version Issues
# Check CUDA version
nvcc --version
python -c "import torch; print(torch.version.cuda)"
# bitsandbytes requires CUDA 11.7+
# If version mismatch, reinstall:
pip uninstall bitsandbytes
pip install bitsandbytes --upgrade
"cannot find libcudart" or Missing Library Errors
# Find CUDA installation
find /usr -name "libcudart*" 2>/dev/null
# Set environment variable
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# Or for conda:
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
Slow Training
Common cause: compute dtype mismatch
# Check if model is using expected dtype
for name, param in model.named_parameters():
if param.requires_grad:
print(f"{name}: {param.dtype}")
break # All LoRA params should match
# Ensure bf16 is used in training args if BitsAndBytesConfig uses bf16
# Mismatch causes constant dtype conversions
Out of Memory
# 1. Enable gradient checkpointing
model.gradient_checkpointing_enable()
# 2. Reduce batch size, increase accumulation
per_device_train_batch_size = 1
gradient_accumulation_steps = 16
# 3. Use paged optimizer
optim = "paged_adamw_8bit"
# 4. Reduce sequence length
max_seq_length = 256
# 5. Target fewer modules
target_modules = ["q_proj", "v_proj"] # Minimal set
Model Loads But Training Fails
# Ensure prepare_model_for_kbit_training is called
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model) # Don't skip this!
# Enable input gradients if needed
model.enable_input_require_grads()
Best Practices
-
Always use
prepare_model_for_kbit_training: This enables gradient computation through the frozen quantized layers -
Match compute dtype with training precision: If
bnb_4bit_compute_dtype=torch.bfloat16, usebf16=Truein training args -
Use paged optimizers for large models:
optim="paged_adamw_8bit"or"paged_adamw_32bit"handles memory spikes -
Start with NF4 + double quantization: This is the recommended default; only change if debugging
-
Gradient checkpointing is essential: Always enable for QLoRA training to fit larger batch sizes
-
Test inference before long training runs: Load the model and generate a few tokens to catch configuration issues early
-
Monitor GPU memory: Use
nvidia-smiortorch.cuda.memory_summary()to track actual usage -
Consider 8-bit for unstable training: If 4-bit training shows instability, try
load_in_8bit=Trueas a middle ground
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
prompt-engineering
Crafting effective prompts for LLMs. Use when designing prompts, improving output quality, structuring complex instructions, or debugging poor model responses.
agents
Patterns and architectures for building AI agents and workflows with LLMs. Use when designing systems that involve tool use, multi-step reasoning, autonomous decision-making, or orchestration of LLM-driven tasks.
rlhf
Understanding Reinforcement Learning from Human Feedback (RLHF) for aligning language models. Use when learning about preference data, reward modeling, policy optimization, or direct alignment algorithms like DPO.
transformers
Loading and using pretrained models with Hugging Face Transformers. Use when working with pretrained models from the Hub, running inference with Pipeline API, fine-tuning models with Trainer, or handling text, vision, audio, and multimodal tasks.
pytorch
Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.
lora
Parameter-efficient fine-tuning with Low-Rank Adaptation (LoRA). Use when fine-tuning large language models with limited GPU memory, creating task-specific adapters, or when you need to train multiple specialized models from a single base.
Didn't find tool you were looking for?