Agent skill

bio-de-edger-basics

Perform differential expression analysis using edgeR in R/Bioconductor. Use for analyzing RNA-seq count data with the quasi-likelihood F-test framework, creating DGEList objects, normalization, dispersion estimation, and statistical testing. Use when performing DE analysis with edgeR.

Stars 2,009
Forks 275

Install this agent skill to your Project

npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-de-edger-basics

SKILL.md

Version Compatibility

Reference examples tested with: DESeq2 1.42+, edgeR 4.0+, limma 3.58+, scanpy 1.10+

Before using code patterns, verify installed versions match. If versions differ:

  • R: packageVersion('<pkg>') then ?function_name to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

edgeR Basics

Differential expression analysis using edgeR's quasi-likelihood framework for RNA-seq count data.

Required Libraries

r
library(edgeR)
library(limma)  # For design matrices and voom

Installation

r
if (!require('BiocManager', quietly = TRUE))
    install.packages('BiocManager')
BiocManager::install('edgeR')

Creating DGEList Object

Goal: Construct an edgeR container from a count matrix with sample group information.

Approach: Wrap raw counts and group labels into a DGEList object for normalization and testing.

"Load my RNA-seq counts into edgeR" → Create a DGEList from a count matrix with sample group assignments and optional gene annotations.

r
# From count matrix
# counts: matrix with genes as rows, samples as columns
# group: factor indicating sample groups
y <- DGEList(counts = counts, group = group)

# With gene annotation
y <- DGEList(counts = counts, group = group, genes = gene_info)

# Check structure
y

Standard edgeR Workflow (Quasi-Likelihood)

Goal: Run the complete edgeR QL pipeline from raw counts to differentially expressed gene lists.

Approach: Filter, normalize (TMM), estimate dispersions, fit quasi-likelihood GLM, and test coefficients with the QL F-test.

"Find differentially expressed genes between my groups" → Test for significant expression differences using negative binomial models with quasi-likelihood F-tests.

r
# Create DGEList
y <- DGEList(counts = counts, group = group)

# Filter low-expression genes
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]

# Normalize (TMM by default)
y <- calcNormFactors(y)

# Create design matrix
design <- model.matrix(~ group)

# Estimate dispersion (optional in edgeR v4+ but improves BCV plots)
y <- estimateDisp(y, design)

# Fit quasi-likelihood model
fit <- glmQLFit(y, design)

# Perform quasi-likelihood F-test
qlf <- glmQLFTest(fit, coef = 2)

# View top genes
topTags(qlf)

Filtering Low-Expression Genes

Goal: Remove genes with insufficient expression to reduce noise and multiple testing burden.

Approach: Apply automatic or manual CPM/count thresholds requiring expression in a minimum number of samples.

r
# Automatic filtering (recommended)
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]

# Manual filtering: CPM threshold
keep <- rowSums(cpm(y) > 1) >= 2  # At least 2 samples with CPM > 1
y <- y[keep, , keep.lib.sizes = FALSE]

# Filter by minimum counts
keep <- rowSums(y$counts >= 10) >= 3  # At least 3 samples with 10+ counts
y <- y[keep, , keep.lib.sizes = FALSE]

Normalization Methods

Goal: Correct for differences in library composition between samples.

Approach: Compute TMM (or alternative) normalization factors that adjust effective library sizes.

r
# TMM normalization (default, recommended)
y <- calcNormFactors(y, method = 'TMM')

# Alternative methods
y <- calcNormFactors(y, method = 'RLE')      # Relative Log Expression
y <- calcNormFactors(y, method = 'upperquartile')
y <- calcNormFactors(y, method = 'none')     # No normalization

# View normalization factors
y$samples$norm.factors

Design Matrices

Goal: Define the linear model structure for the experimental design.

Approach: Build model matrices encoding group, batch, and interaction terms for the GLM.

r
# Simple two-group comparison
design <- model.matrix(~ group)

# With batch correction
design <- model.matrix(~ batch + group)

# Interaction model
design <- model.matrix(~ genotype + treatment + genotype:treatment)

# No intercept (for direct group comparisons)
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)

Dispersion Estimation

Goal: Estimate biological variability (dispersion) to parameterize the negative binomial model.

Approach: Compute common, trended, and gene-wise dispersions using empirical Bayes moderation.

r
# Estimate all dispersions
y <- estimateDisp(y, design)

# Or estimate separately
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

# View dispersions
y$common.dispersion
y$trended.dispersion
y$tagwise.dispersion

# Plot BCV (biological coefficient of variation)
plotBCV(y)

Quasi-Likelihood Testing

Goal: Test for differential expression using the quasi-likelihood framework for robust inference.

Approach: Fit a QL GLM and test individual coefficients, contrasts, or multiple coefficients simultaneously.

r
# Fit QL model
fit <- glmQLFit(y, design)

# Test specific coefficient
qlf <- glmQLFTest(fit, coef = 2)

# Test with contrast
contrast <- makeContrasts(groupB - groupA, levels = design)
qlf <- glmQLFTest(fit, contrast = contrast)

# Test multiple coefficients (ANOVA-like)
qlf <- glmQLFTest(fit, coef = 2:3)

Making Contrasts

Goal: Define specific pairwise or complex group comparisons for testing.

Approach: Use makeContrasts with a no-intercept design to specify arbitrary between-group differences.

r
# Design without intercept
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

# Pairwise comparisons
contrast <- makeContrasts(
    TreatedVsControl = treated - control,
    DrugAVsControl = drugA - control,
    DrugBVsControl = drugB - control,
    DrugAVsDrugB = drugA - drugB,
    levels = design
)

# Test each contrast
qlf_treated <- glmQLFTest(fit, contrast = contrast[, 'TreatedVsControl'])
qlf_drugA <- glmQLFTest(fit, contrast = contrast[, 'DrugAVsControl'])

Accessing Results

Goal: Retrieve and filter DE results from the fitted model.

Approach: Use topTags to extract ranked gene lists with FDR-corrected p-values.

r
# Top differentially expressed genes
topTags(qlf, n = 20)

# All results as data frame
results <- topTags(qlf, n = Inf)$table

# Summary of DE genes at different thresholds
summary(decideTests(qlf))

# Get DE genes with specific cutoffs
de_genes <- topTags(qlf, n = Inf, p.value = 0.05)$table

Result Columns

Column Description
logFC Log2 fold change
logCPM Average log2 counts per million
F Quasi-likelihood F-statistic
PValue Raw p-value
FDR False discovery rate (adjusted p-value)

Alternative: Exact Test (Classic edgeR)

Goal: Perform a simple two-group DE test without a design matrix.

Approach: Use the classic edgeR exact test based on the negative binomial distribution.

r
# For simple two-group comparison only
y <- DGEList(counts = counts, group = group)
y <- calcNormFactors(y)
y <- estimateDisp(y)

# Exact test
et <- exactTest(y)
topTags(et)

Alternative: glmLRT (Likelihood Ratio Test)

Goal: Test for DE using likelihood ratio tests as an alternative to the QL F-test.

Approach: Fit a standard GLM and compare nested models via the likelihood ratio statistic.

r
# Fit GLM
fit <- glmFit(y, design)

# Likelihood ratio test
lrt <- glmLRT(fit, coef = 2)
topTags(lrt)

Treat Test (Log Fold Change Threshold)

Goal: Test whether genes exceed a minimum fold change threshold, not just differ from zero.

Approach: Use glmTreat to apply a fold change threshold directly in the statistical test.

r
# Test for |logFC| > threshold
tr <- glmTreat(fit, coef = 2, lfc = log2(1.5))  # |FC| > 1.5
topTags(tr)

Multi-Factor Designs

Goal: Test for condition effects while adjusting for batch or other covariates.

Approach: Include nuisance variables in the design matrix so the QL test controls for them.

r
# Design with batch correction
design <- model.matrix(~ batch + condition, data = sample_info)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

# Test condition effect (controlling for batch)
# Condition coefficient is typically the last
qlf <- glmQLFTest(fit, coef = ncol(design))

Getting Normalized Counts

Goal: Obtain normalized expression values for visualization and downstream analysis.

Approach: Compute CPM or log-CPM values using TMM-adjusted library sizes.

r
# Counts per million (CPM)
cpm_values <- cpm(y)

# Log2 CPM
log_cpm <- cpm(y, log = TRUE)

# RPM (reads per million, same as CPM)
rpm_values <- cpm(y)

# With prior count for log transformation
log_cpm <- cpm(y, log = TRUE, prior.count = 2)

Exporting Results

Goal: Save DE results and normalized counts to files for sharing or downstream tools.

Approach: Extract all results via topTags and write as CSV alongside CPM values.

r
# Get all results
all_results <- topTags(qlf, n = Inf)$table

# Add gene IDs as column
all_results$gene_id <- rownames(all_results)

# Write to file
write.csv(all_results, file = 'edger_results.csv', row.names = FALSE)

# Export normalized counts
write.csv(cpm(y), file = 'cpm_values.csv')

Common Errors

Error Cause Solution
"design matrix not full rank" Confounded variables Check sample metadata
"No residual df" Too few samples Need more replicates
"NA/NaN/Inf" Zero counts in all samples Filter more stringently

Deprecated/Changed Functions

Old Status New
decidetestsDGE() Removed (v4.4) decideTests()
glmFit() + glmLRT() Still works Prefer glmQLFit() + glmQLFTest()
estimateDisp() Optional (v4+) glmQLFit() estimates internally
mglmLS(), mglmSimple() Retired mglmLevenberg() or mglmOneWay()

Note: calcNormFactors() and normLibSizes() are synonyms - both work.

Quick Reference: Workflow Steps

r
# 1. Create DGEList
y <- DGEList(counts = counts, group = group)

# 2. Filter low-expression genes
keep <- filterByExpr(y, group = group)
y <- y[keep, , keep.lib.sizes = FALSE]

# 3. Normalize
y <- calcNormFactors(y)

# 4. Create design matrix
design <- model.matrix(~ group)

# 5. Estimate dispersion (optional in v4+)
y <- estimateDisp(y, design)

# 6. Fit quasi-likelihood model
fit <- glmQLFit(y, design)

# 7. Test for DE
qlf <- glmQLFTest(fit, coef = 2)

# 8. Get results
topTags(qlf, n = 20)

Choosing edgeR vs DESeq2

Aspect edgeR DESeq2
Model Negative binomial + QL Negative binomial
Shrinkage Empirical Bayes on dispersions Shrinkage estimators for LFC
Small samples Robust with QL framework Good with shrinkage
Speed Generally faster Slower for large datasets
Output F-statistic, FDR Wald statistic, padj

Related Skills

  • deseq2-basics - Alternative DE analysis with DESeq2
  • de-visualization - MA plots, volcano plots, heatmaps
  • de-results - Extract and export significant genes

Expand your agent's capabilities with these related and highly-rated skills.

FreedomIntelligence/OpenClaw-Medical-Skills

vcf-annotator

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

bio-alignment-io

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

sleep-analyzer

分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

bio-hi-c-analysis-matrix-operations

Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.

2,009 275
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results