Agent skill
bio-workflows-metagenomics-pipeline
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-workflows-metagenomics-pipeline
SKILL.md
name: bio-workflows-metagenomics-pipeline description: End-to-end metagenomics workflow from FASTQ to taxonomic and functional profiles. Covers Kraken2 classification, Bracken abundance estimation, and HUMAnN functional profiling. Use when profiling metagenomic samples. tool_type: cli primary_tool: Kraken2 workflow: true depends_on:
- read-qc/fastp-workflow
- metagenomics/kraken-classification
- metagenomics/metaphlan-profiling
- metagenomics/abundance-estimation
- metagenomics/functional-profiling
- metagenomics/metagenome-visualization qc_checkpoints:
- after_qc: "Q30 >80%, host reads removed"
- after_classification: "Classification rate >60%, known taxa dominant"
- after_functional: "Pathway coverage reasonable, unmapped <50%" measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Metagenomics Pipeline
Complete workflow from metagenomic FASTQ to taxonomic and functional profiles.
Workflow Overview
FASTQ files
|
v
[1. QC & Host Removal] --> fastp + Bowtie2
|
v
[2. Taxonomic Classification]
|
+---> Kraken2 + Bracken (fast, database-dependent)
|
+---> MetaPhlAn (marker-based, standardized)
|
v
[3. Functional Profiling] --> HUMAnN
|
v
Taxonomic profiles + Pathway abundances
Primary Path: Kraken2 + Bracken + HUMAnN
Step 1: Quality Control and Host Removal
# QC with fastp
for sample in sample1 sample2 sample3; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o trimmed/${sample}_R1.fq.gz -O trimmed/${sample}_R2.fq.gz \
--detect_adapter_for_pe \
--qualified_quality_phred 20 \
--length_required 50 \
--html qc/${sample}_fastp.html
done
# Remove host reads (human example)
for sample in sample1 sample2 sample3; do
bowtie2 -p 8 -x human_index \
-1 trimmed/${sample}_R1.fq.gz \
-2 trimmed/${sample}_R2.fq.gz \
--un-conc-gz host_removed/${sample}_R%.fq.gz \
> /dev/null 2> qc/${sample}_host_removal.log
done
Step 2A: Kraken2 Classification
# Classify reads
for sample in sample1 sample2 sample3; do
kraken2 --db kraken2_db \
--threads 8 \
--paired \
--report kraken/${sample}.report \
--output kraken/${sample}.output \
host_removed/${sample}_R1.fq.gz \
host_removed/${sample}_R2.fq.gz
done
Step 2B: Bracken Abundance Estimation
# Estimate species abundance
for sample in sample1 sample2 sample3; do
bracken -d kraken2_db \
-i kraken/${sample}.report \
-o bracken/${sample}.species.txt \
-r 150 \
-l S \
-t 10
done
# Combine samples into abundance matrix
combine_bracken_outputs.py \
--files bracken/*.species.txt \
-o bracken/combined_species.txt
Step 2C: Alternative - MetaPhlAn Profiling
# Profile with MetaPhlAn 4
for sample in sample1 sample2 sample3; do
metaphlan host_removed/${sample}_R1.fq.gz,host_removed/${sample}_R2.fq.gz \
--bowtie2out metaphlan/${sample}.bowtie2.bz2 \
--input_type fastq \
--nproc 8 \
-o metaphlan/${sample}_profile.txt
done
# Merge profiles
merge_metaphlan_tables.py metaphlan/*_profile.txt > metaphlan/merged_abundance.txt
Step 3: Functional Profiling with HUMAnN
# Run HUMAnN
for sample in sample1 sample2 sample3; do
# Concatenate paired reads
cat host_removed/${sample}_R1.fq.gz host_removed/${sample}_R2.fq.gz > \
host_removed/${sample}_concat.fq.gz
humann --input host_removed/${sample}_concat.fq.gz \
--output humann/${sample} \
--threads 8 \
--metaphlan-options "--bowtie2db metaphlan_db"
done
# Normalize and join tables
humann_renorm_table --input humann/sample1/sample1_pathabundance.tsv \
--output humann/sample1/sample1_pathabundance_cpm.tsv \
--units cpm
humann_join_tables --input humann \
--output humann/merged_pathabundance.tsv \
--file_name pathabundance
Visualization
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load Bracken species table
species = pd.read_csv('bracken/combined_species.txt', sep='\t', index_col=0)
# Top 20 species heatmap
top20 = species.sum(axis=1).nlargest(20).index
plt.figure(figsize=(12, 8))
sns.heatmap(species.loc[top20], cmap='viridis', annot=False)
plt.title('Top 20 Species Abundance')
plt.tight_layout()
plt.savefig('top20_species_heatmap.pdf')
# Stacked bar plot
species_norm = species.div(species.sum()) * 100
top10 = species_norm.sum(axis=1).nlargest(10).index
other = species_norm.loc[~species_norm.index.isin(top10)].sum()
plot_data = species_norm.loc[top10].T
plot_data['Other'] = other
plot_data.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.ylabel('Relative Abundance (%)')
plt.legend(bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.savefig('species_barplot.pdf')
Parameter Recommendations
| Step | Parameter | Value |
|---|---|---|
| fastp | --length_required | 50 (metagenomic reads) |
| Kraken2 | --confidence | 0.0 (default) or 0.1 |
| Bracken | -r | Read length (e.g., 150) |
| Bracken | -l | S (species) or G (genus) |
| Bracken | -t | 10 (min reads threshold) |
| MetaPhlAn | --min_cu_len | 2000 (default) |
| HUMAnN | --threads | 8+ |
Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| Low classification rate | Database mismatch, novel organisms | Try different database, check sample type |
| High unclassified | Novel microbes, host contamination | Remove host, use larger database |
| High host reads | Incomplete host removal | Use multiple host reference genomes |
| HUMAnN slow | Large files | Increase threads, pre-filter reads |
Complete Pipeline Script
#!/bin/bash
set -e
THREADS=8
KRAKEN_DB="kraken2_standard_db"
HOST_INDEX="human_bt2_index"
SAMPLES="sample1 sample2 sample3"
OUTDIR="metagenomics_results"
mkdir -p ${OUTDIR}/{trimmed,host_removed,kraken,bracken,metaphlan,humann,qc}
# Step 1: QC
echo "=== QC ==="
for sample in $SAMPLES; do
fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
-o ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-O ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--length_required 50 \
--html ${OUTDIR}/qc/${sample}_fastp.html -w ${THREADS}
done
# Host removal
echo "=== Host Removal ==="
for sample in $SAMPLES; do
bowtie2 -p ${THREADS} -x ${HOST_INDEX} \
-1 ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
-2 ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
--un-conc-gz ${OUTDIR}/host_removed/${sample}_R%.fq.gz \
> /dev/null 2> ${OUTDIR}/qc/${sample}_host.log
done
# Step 2: Kraken2
echo "=== Kraken2 ==="
for sample in $SAMPLES; do
kraken2 --db ${KRAKEN_DB} --threads ${THREADS} --paired \
--report ${OUTDIR}/kraken/${sample}.report \
--output ${OUTDIR}/kraken/${sample}.output \
${OUTDIR}/host_removed/${sample}_R1.fq.gz \
${OUTDIR}/host_removed/${sample}_R2.fq.gz
done
# Bracken
echo "=== Bracken ==="
for sample in $SAMPLES; do
bracken -d ${KRAKEN_DB} \
-i ${OUTDIR}/kraken/${sample}.report \
-o ${OUTDIR}/bracken/${sample}.species.txt \
-r 150 -l S -t 10
done
echo "=== Pipeline Complete ==="
echo "Kraken reports: ${OUTDIR}/kraken/"
echo "Bracken abundances: ${OUTDIR}/bracken/"
Related Skills
- metagenomics/kraken-classification - Kraken2 details
- metagenomics/metaphlan-profiling - MetaPhlAn parameters
- metagenomics/abundance-estimation - Bracken options
- metagenomics/functional-profiling - HUMAnN workflow
- metagenomics/metagenome-visualization - Plotting functions
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?