Agent skill
bio-genome-assembly-contamination-detection
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-genome-assembly-contamination-detection
SKILL.md
name: bio-genome-assembly-contamination-detection description: Detect contamination and assess genome quality using CheckM, CheckM2, GTDB-Tk, and GUNC for metagenome-assembled genomes and isolate assemblies. Use when checking assemblies for contamination. tool_type: cli primary_tool: CheckM2 measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Contamination Detection
CheckM2 (Recommended)
# Run CheckM2 on single genome
checkm2 predict --input assembly.fa --output-directory checkm2_output --threads 16
# Run on multiple genomes (directory of FASTAs)
checkm2 predict --input genomes/ --output-directory checkm2_output \
--threads 16 --extension fa
# Output: quality_report.tsv with Completeness, Contamination, Coding_Density
Interpret CheckM2 Results
# quality_report.tsv columns:
# Name, Completeness, Contamination, Completeness_Model_Used,
# Translation_Table_Used, Coding_Density, Contig_N50, Average_Gene_Length,
# Genome_Size, GC_Content, Total_Coding_Sequences
# Filter high-quality genomes (MIMAG standards)
awk -F'\t' 'NR==1 || ($2 > 90 && $3 < 5)' quality_report.tsv > high_quality_mags.tsv
# Medium quality
awk -F'\t' 'NR==1 || ($2 >= 50 && $3 < 10)' quality_report.tsv > medium_quality_mags.tsv
CheckM (Original)
# Run CheckM lineage workflow
checkm lineage_wf -t 16 -x fa genomes/ checkm_output/
# Generate summary
checkm qa checkm_output/lineage.ms checkm_output/ -o 2 -f checkm_summary.tsv --tab_table
# Extended report with marker genes
checkm qa checkm_output/lineage.ms checkm_output/ -o 2 --tab_table \
-f checkm_extended.tsv
CheckM Plots
# Completeness vs Contamination plot
checkm bin_qa_plot -x fa checkm_output/ genomes/ plots/
# GC and coding density
checkm coding_plot -x fa checkm_output/ genomes/ plots/
# Marker gene positions
checkm marker_plot -x fa checkm_output/ genomes/ plots/
GTDB-Tk Taxonomic Classification
# Classify genomes
gtdbtk classify_wf --genome_dir genomes/ --out_dir gtdbtk_output \
--extension fa --cpus 16
# With species-level ANI
gtdbtk classify_wf --genome_dir genomes/ --out_dir gtdbtk_output \
--extension fa --cpus 16 --skip_ani_screen
# Output files:
# gtdbtk.bac120.summary.tsv - bacterial classifications
# gtdbtk.ar53.summary.tsv - archaeal classifications
GTDB-Tk De Novo Workflow
# When genomes may include novel taxa
gtdbtk de_novo_wf --genome_dir genomes/ --out_dir gtdbtk_denovo \
--bacteria --extension fa --cpus 16
GUNC Chimerism Detection
# Run GUNC
gunc run -d genomes/ -o gunc_output -t 16 -e .fa
# Output: GUNC.progenomes_2.1.maxCSS_level.tsv
# Key columns: pass.GUNC (true/false), contamination_portion, clade_separation_score
# Filter chimeric genomes
awk -F'\t' '$8 == "False"' GUNC.progenomes_2.1.maxCSS_level.tsv > chimeric_genomes.tsv
GUNC Interpretation
# GUNC flags genomes as chimeric if:
# - clade_separation_score (CSS) > 0.45
# - contamination_portion > 0.05
# - reference_representation_score > 0.5
# Combine with CheckM2 for full QC
join -t$'\t' -1 1 -2 1 \
<(sort checkm2_output/quality_report.tsv) \
<(sort gunc_output/GUNC.progenomes_2.1.maxCSS_level.tsv) \
> combined_qc.tsv
Comprehensive QC Pipeline
#!/bin/bash
GENOMES_DIR=$1
OUTPUT_DIR=$2
THREADS=${3:-16}
mkdir -p "$OUTPUT_DIR"
# Run CheckM2
echo "Running CheckM2..."
checkm2 predict --input "$GENOMES_DIR" --output-directory "$OUTPUT_DIR/checkm2" \
--threads "$THREADS" --extension fa
# Run GUNC
echo "Running GUNC..."
gunc run -d "$GENOMES_DIR" -o "$OUTPUT_DIR/gunc" -t "$THREADS" -e .fa
# Run GTDB-Tk
echo "Running GTDB-Tk..."
gtdbtk classify_wf --genome_dir "$GENOMES_DIR" --out_dir "$OUTPUT_DIR/gtdbtk" \
--extension fa --cpus "$THREADS"
echo "QC complete!"
Filter by Quality Standards
import pandas as pd
checkm = pd.read_csv('checkm2_output/quality_report.tsv', sep='\t')
gunc = pd.read_csv('gunc_output/GUNC.progenomes_2.1.maxCSS_level.tsv', sep='\t')
merged = checkm.merge(gunc, left_on='Name', right_on='genome', how='left')
# MIMAG High Quality: >90% complete, <5% contamination, not chimeric
hq = merged[(merged['Completeness'] > 90) &
(merged['Contamination'] < 5) &
(merged['pass.GUNC'] == True)]
# MIMAG Medium Quality: >50% complete, <10% contamination
mq = merged[(merged['Completeness'] >= 50) &
(merged['Contamination'] < 10)]
hq.to_csv('high_quality_genomes.tsv', sep='\t', index=False)
mq.to_csv('medium_quality_genomes.tsv', sep='\t', index=False)
Remove Contamination
# Use MAGpurify to remove contaminating contigs
magpurify phylo-markers genome.fa magpurify_output
magpurify clade-markers genome.fa magpurify_output
magpurify conspecific genome.fa magpurify_output
magpurify tetra-freq genome.fa magpurify_output
magpurify gc-content genome.fa magpurify_output
magpurify known-contam genome.fa magpurify_output
magpurify clean-bin genome.fa magpurify_output cleaned_genome.fa
Detect Foreign Contigs
# Contig-level taxonomy with CAT
CAT contigs -c assembly.fa -d CAT_database -t CAT_taxonomy \
-o cat_output -n 16
# Parse results
CAT add_names -i cat_output.contig2classification.txt \
-o cat_output.contig2classification.named.txt \
-t CAT_taxonomy --only_official
# Flag contigs with different taxonomy than majority
awk -F'\t' '{print $1, $NF}' cat_output.contig2classification.named.txt | \
sort | uniq -c | sort -rn
Decontaminate with BlobTools
# Create BlobDB
blobtools create -i assembly.fa -b aligned.bam -t blast_hits.txt \
-o blobtools_output
# Generate plots
blobtools plot -i blobtools_output.blobDB.json
# Filter by taxonomy
blobtools view -i blobtools_output.blobDB.json -r all -o filtered
Related Skills
- genome-assembly/assembly-qc - BUSCO and other QC
- genome-assembly/long-read-assembly - Assembly methods
- metagenomics/taxonomic-profiling - Metagenome analysis
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?