Agent skill
bio-workflows-rnaseq-to-de
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-workflows-rnaseq-to-de
SKILL.md
name: bio-workflows-rnaseq-to-de description: End-to-end RNA-seq workflow from FASTQ files to differential expression results. Covers QC, quantification (Salmon or STAR+featureCounts), and DESeq2 analysis with visualization. Use when running RNA-seq from FASTQ to DE results. tool_type: mixed primary_tool: DESeq2 workflow: true depends_on:
- read-qc/fastp-workflow
- rna-quantification/alignment-free-quant
- rna-quantification/tximport-workflow
- differential-expression/deseq2-basics
- differential-expression/de-visualization qc_checkpoints:
- after_qc: "Q30 >80%, adapter content <5%"
- after_quant: "Mapping rate >70%, >10M reads mapped"
- after_de: "Dispersion fit reasonable, no sample outliers" measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
RNA-seq to Differential Expression Workflow
Complete pipeline from raw FASTQ files to differential expression results.
Workflow Overview
FASTQ files
|
v
[1. QC & Trimming] -----> fastp
|
v
[2. Quantification] ----> Salmon (recommended) or STAR + featureCounts
|
v
[3. Import to R] -------> tximport (for Salmon) or direct counts
|
v
[4. DE Analysis] -------> DESeq2
|
v
[5. Visualization] -----> Volcano, MA, heatmaps
|
v
Significant gene list
Primary Path: Salmon + DESeq2
Step 1: Quality Control with fastp
# Single sample
fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.trimmed.fq.gz -O sample_R2.trimmed.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 35 \
--html sample_fastp.html
# Batch processing
for sample in sample1 sample2 sample3; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o trimmed/${sample}_R1.fq.gz -O trimmed/${sample}_R2.fq.gz \
--detect_adapter_for_pe \
--html qc/${sample}_fastp.html
done
QC Checkpoint 1: Check fastp reports
- Q30 bases >80%
- Adapter content <5%
- Duplication rate reasonable for library type
Step 2: Salmon Quantification
# Build index (once per transcriptome)
salmon index -t transcriptome.fa -i salmon_index -k 31
# Quantify each sample
for sample in sample1 sample2 sample3; do
salmon quant -i salmon_index \
-l A \
-1 trimmed/${sample}_R1.fq.gz \
-2 trimmed/${sample}_R2.fq.gz \
-o quants/${sample} \
--validateMappings \
--gcBias \
--seqBias \
-p 8
done
QC Checkpoint 2: Check Salmon logs
- Mapping rate >70%
-
10 million reads mapped
Step 3: Import with tximport
library(tximport)
library(DESeq2)
# Create tx2gene mapping (Ensembl example)
tx2gene <- read.csv('tx2gene.csv') # columns: TXNAME, GENEID
# List quantification files
samples <- c('sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6')
files <- file.path('quants', samples, 'quant.sf')
names(files) <- samples
# Import transcript-level estimates
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
# Create sample metadata
coldata <- data.frame(
condition = factor(c('control', 'control', 'control', 'treated', 'treated', 'treated')),
row.names = samples
)
Step 4: DESeq2 Analysis
# Create DESeqDataSet from tximport
dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)
# Pre-filter low count genes
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# Set reference level
dds$condition <- relevel(dds$condition, ref = 'control')
# Run DESeq2
dds <- DESeq(dds)
# Get results with shrinkage
res <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
# Summary
summary(res)
QC Checkpoint 3: Check DESeq2 diagnostics
- Dispersion plot shows expected trend
- PCA separates conditions
- No severe outliers in sample distances
Step 5: Visualization and Export
library(ggplot2)
library(pheatmap)
library(ggrepel)
# Volcano plot
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
res_df$significant <- res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1
ggplot(res_df, aes(x = log2FoldChange, y = -log10(pvalue), color = significant)) +
geom_point(alpha = 0.5) +
scale_color_manual(values = c('grey', 'red')) +
theme_minimal() +
labs(title = 'Volcano Plot', x = 'Log2 Fold Change', y = '-Log10 P-value')
# Heatmap of top genes
vsd <- vst(dds, blind = FALSE)
top_genes <- head(order(res$padj), 50)
pheatmap(assay(vsd)[top_genes,], scale = 'row', show_rownames = FALSE)
# Export significant genes
sig_genes <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
write.csv(as.data.frame(sig_genes), 'significant_genes.csv')
Alternative Path: STAR + featureCounts + DESeq2
Step 2 Alternative: STAR Alignment
# Build STAR index (once)
STAR --runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles genome.fa \
--sjdbGTFfile genes.gtf \
--sjdbOverhang 100 \
--runThreadN 8
# Align each sample
for sample in sample1 sample2 sample3; do
STAR --genomeDir star_index \
--readFilesIn trimmed/${sample}_R1.fq.gz trimmed/${sample}_R2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix aligned/${sample}_ \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--runThreadN 8
done
Step 3 Alternative: featureCounts
# Count reads per gene
featureCounts -T 8 -p --countReadPairs \
-a genes.gtf \
-o counts.txt \
aligned/*_Aligned.sortedByCoord.out.bam
Step 4 Alternative: Load Counts Directly
# Load featureCounts output
counts <- read.table('counts.txt', header = TRUE, row.names = 1, skip = 1)
counts <- counts[, 6:ncol(counts)] # Remove annotation columns
colnames(counts) <- gsub('_Aligned.sortedByCoord.out.bam', '', colnames(counts))
# Create DESeqDataSet directly
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition)
Parameter Recommendations
| Step | Parameter | Recommendation |
|---|---|---|
| fastp | --qualified_quality_phred | 20 (standard) |
| fastp | --length_required | 35 for 2x100, 50 for 2x150 |
| Salmon | -l | A (auto-detect library type) |
| Salmon | --gcBias | Enable for better accuracy |
| STAR | --sjdbOverhang | read_length - 1 |
| featureCounts | -s | 0=unstranded, 1=stranded, 2=reversely stranded |
| DESeq2 | lfcShrink type | apeglm (recommended) |
| DESeq2 | alpha | 0.05 (standard significance) |
Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| Low mapping rate (<50%) | Wrong reference, contamination | Check species, run FastQ Screen |
| High duplication | Low complexity library, over-sequencing | Check library prep, may be normal for low-input |
| No DE genes | Low power, batch effects | Add replicates, include batch in design |
| All genes DE | Normalization issue, sample swap | Check sample metadata, rerun normalization |
| Outlier samples | Technical failure, sample swap | Remove or investigate, check PCA |
Complete Bash Pipeline Script
#!/bin/bash
set -e
THREADS=8
SAMPLES="sample1 sample2 sample3 sample4 sample5 sample6"
SALMON_INDEX="salmon_index"
OUTDIR="results"
mkdir -p ${OUTDIR}/{trimmed,quants,qc}
# Step 1: QC and trim
for sample in $SAMPLES; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-O ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--detect_adapter_for_pe \
--html ${OUTDIR}/qc/${sample}_fastp.html \
-w ${THREADS}
done
# Step 2: Quantify
for sample in $SAMPLES; do
salmon quant -i ${SALMON_INDEX} -l A \
-1 ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-2 ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
-o ${OUTDIR}/quants/${sample} \
--validateMappings --gcBias -p ${THREADS}
done
echo "Quantification complete. Run R script for DE analysis."
Complete R Analysis Script
library(tximport)
library(DESeq2)
library(apeglm)
library(ggplot2)
library(pheatmap)
# Configuration
samples <- c('sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6')
conditions <- c('control', 'control', 'control', 'treated', 'treated', 'treated')
quant_dir <- 'results/quants'
# Import
tx2gene <- read.csv('tx2gene.csv')
files <- file.path(quant_dir, samples, 'quant.sf')
names(files) <- samples
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
# DESeq2
coldata <- data.frame(condition = factor(conditions), row.names = samples)
dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)
dds <- dds[rowSums(counts(dds)) >= 10,]
dds$condition <- relevel(dds$condition, ref = 'control')
dds <- DESeq(dds)
# Results
res <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
sig <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
cat('Significant genes:', nrow(sig), '\n')
write.csv(as.data.frame(sig), 'significant_genes.csv')
Related Skills
- read-qc/fastp-workflow - Detailed QC options and parameters
- rna-quantification/alignment-free-quant - Salmon and kallisto details
- rna-quantification/tximport-workflow - tximport options and tx2gene creation
- differential-expression/deseq2-basics - Complete DESeq2 reference
- differential-expression/de-visualization - Advanced visualization options
- pathway-analysis/go-enrichment - Next step: functional enrichment
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?