Agent skill
gatk-variant-calling
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/variant-interpretation-acmg/bioSkills/gatk-variant-calling
SKILL.md
name: bio-gatk-variant-calling description: Variant calling with GATK HaplotypeCaller following best practices. Covers germline SNP/indel calling, GVCF workflow for cohorts, joint genotyping, and variant quality score recalibration (VQSR). Use when calling variants with GATK HaplotypeCaller. tool_type: cli primary_tool: gatk measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
GATK Variant Calling
GATK HaplotypeCaller is the gold standard for germline variant calling. This skill covers the GATK Best Practices workflow.
Prerequisites
BAM files should be preprocessed:
- Mark duplicates
- Base quality score recalibration (BQSR) - optional but recommended
Single-Sample Calling
Basic HaplotypeCaller
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.vcf.gz
With Standard Annotations
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.vcf.gz \
-A Coverage \
-A QualByDepth \
-A FisherStrand \
-A StrandOddsRatio \
-A MappingQualityRankSumTest \
-A ReadPosRankSumTest
Target Intervals (Exome/Panel)
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-L targets.interval_list \
-O sample.vcf.gz
Adjust Calling Confidence
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.vcf.gz \
--standard-min-confidence-threshold-for-calling 20
GVCF Workflow (Recommended for Cohorts)
The GVCF workflow enables joint genotyping across samples for better variant calls.
Step 1: Generate GVCFs per Sample
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.g.vcf.gz \
-ERC GVCF
Step 2: Combine GVCFs (GenomicsDBImport)
# Create sample map file
# sample_map.txt:
# sample1 /path/to/sample1.g.vcf.gz
# sample2 /path/to/sample2.g.vcf.gz
gatk GenomicsDBImport \
--genomicsdb-workspace-path genomicsdb \
--sample-name-map sample_map.txt \
-L intervals.interval_list
Alternative: CombineGVCFs (smaller cohorts)
gatk CombineGVCFs \
-R reference.fa \
-V sample1.g.vcf.gz \
-V sample2.g.vcf.gz \
-V sample3.g.vcf.gz \
-O cohort.g.vcf.gz
Step 3: Joint Genotyping
# From GenomicsDB
gatk GenotypeGVCFs \
-R reference.fa \
-V gendb://genomicsdb \
-O cohort.vcf.gz
# From combined GVCF
gatk GenotypeGVCFs \
-R reference.fa \
-V cohort.g.vcf.gz \
-O cohort.vcf.gz
Variant Quality Score Recalibration (VQSR)
Machine learning-based filtering using known variant sites. Requires many variants (WGS preferred).
SNP Recalibration
# Build SNP model
gatk VariantRecalibrator \
-R reference.fa \
-V cohort.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
--resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode SNP \
-O snp.recal \
--tranches-file snp.tranches
# Apply SNP filter
gatk ApplyVQSR \
-R reference.fa \
-V cohort.vcf.gz \
-O cohort.snp_recal.vcf.gz \
--recal-file snp.recal \
--tranches-file snp.tranches \
--truth-sensitivity-filter-level 99.5 \
-mode SNP
Indel Recalibration
# Build Indel model
gatk VariantRecalibrator \
-R reference.fa \
-V cohort.snp_recal.vcf.gz \
--resource:mills,known=false,training=true,truth=true,prior=12.0 Mills.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode INDEL \
--max-gaussians 4 \
-O indel.recal \
--tranches-file indel.tranches
# Apply Indel filter
gatk ApplyVQSR \
-R reference.fa \
-V cohort.snp_recal.vcf.gz \
-O cohort.vqsr.vcf.gz \
--recal-file indel.recal \
--tranches-file indel.tranches \
--truth-sensitivity-filter-level 99.0 \
-mode INDEL
Hard Filtering (When VQSR Not Suitable)
For small datasets, exomes, or single samples where VQSR fails.
Extract SNPs and Indels
gatk SelectVariants \
-R reference.fa \
-V cohort.vcf.gz \
--select-type-to-include SNP \
-O snps.vcf.gz
gatk SelectVariants \
-R reference.fa \
-V cohort.vcf.gz \
--select-type-to-include INDEL \
-O indels.vcf.gz
Apply Hard Filters
# Filter SNPs
gatk VariantFiltration \
-R reference.fa \
-V snps.vcf.gz \
-O snps.filtered.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
--filter-expression "SOR > 3.0" --filter-name "SOR3"
# Filter Indels
gatk VariantFiltration \
-R reference.fa \
-V indels.vcf.gz \
-O indels.filtered.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
--filter-expression "SOR > 10.0" --filter-name "SOR10"
Merge Filtered Variants
gatk MergeVcfs \
-I snps.filtered.vcf.gz \
-I indels.filtered.vcf.gz \
-O cohort.filtered.vcf.gz
Base Quality Score Recalibration (BQSR)
Preprocessing step to correct systematic errors in base quality scores.
Step 1: BaseRecalibrator
gatk BaseRecalibrator \
-R reference.fa \
-I sample.bam \
--known-sites dbsnp.vcf.gz \
--known-sites known_indels.vcf.gz \
-O recal_data.table
Step 2: ApplyBQSR
gatk ApplyBQSR \
-R reference.fa \
-I sample.bam \
--bqsr-recal-file recal_data.table \
-O sample.recal.bam
Parallel Processing
Scatter by Interval
# Split calling across intervals
for interval in chr{1..22} chrX chrY; do
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-L $interval \
-O sample.${interval}.g.vcf.gz \
-ERC GVCF &
done
wait
# Gather GVCFs
gatk GatherVcfs \
-I sample.chr1.g.vcf.gz \
-I sample.chr2.g.vcf.gz \
... \
-O sample.g.vcf.gz
Native Pairwise Parallelism
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.vcf.gz \
--native-pair-hmm-threads 4
CNN Score Variant Filter (Deep Learning)
Alternative to VQSR using convolutional neural network.
Score Variants
gatk CNNScoreVariants \
-R reference.fa \
-V cohort.vcf.gz \
-O cohort.cnn_scored.vcf.gz \
--tensor-type reference
Filter by CNN Score
gatk FilterVariantTranches \
-V cohort.cnn_scored.vcf.gz \
-O cohort.cnn_filtered.vcf.gz \
--resource hapmap.vcf.gz \
--resource mills.vcf.gz \
--info-key CNN_1D \
--snp-tranche 99.95 \
--indel-tranche 99.4
Complete Single-Sample Pipeline
#!/bin/bash
SAMPLE=$1
REF=reference.fa
DBSNP=dbsnp.vcf.gz
KNOWN_INDELS=known_indels.vcf.gz
# BQSR
gatk BaseRecalibrator -R $REF -I ${SAMPLE}.bam \
--known-sites $DBSNP --known-sites $KNOWN_INDELS \
-O ${SAMPLE}.recal.table
gatk ApplyBQSR -R $REF -I ${SAMPLE}.bam \
--bqsr-recal-file ${SAMPLE}.recal.table \
-O ${SAMPLE}.recal.bam
# Call variants
gatk HaplotypeCaller -R $REF -I ${SAMPLE}.recal.bam \
-O ${SAMPLE}.g.vcf.gz -ERC GVCF
# Single-sample genotyping
gatk GenotypeGVCFs -R $REF -V ${SAMPLE}.g.vcf.gz \
-O ${SAMPLE}.vcf.gz
# Hard filter
gatk VariantFiltration -R $REF -V ${SAMPLE}.vcf.gz \
-O ${SAMPLE}.filtered.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "LowQD" \
--filter-expression "FS > 60.0" --filter-name "HighFS" \
--filter-expression "MQ < 40.0" --filter-name "LowMQ"
Key Annotations
| Annotation | Description | Good Values |
|---|---|---|
| QD | Quality by Depth | > 2.0 |
| FS | Fisher Strand | < 60 (SNP), < 200 (Indel) |
| SOR | Strand Odds Ratio | < 3 (SNP), < 10 (Indel) |
| MQ | Mapping Quality | > 40 |
| MQRankSum | MQ Rank Sum Test | > -12.5 |
| ReadPosRankSum | Read Position Rank Sum | > -8.0 (SNP), > -20.0 (Indel) |
Resource Files
| Resource | Use |
|---|---|
| dbSNP | Known variants (prior=2.0) |
| HapMap | Training/truth SNPs (prior=15.0) |
| Omni | Training SNPs (prior=12.0) |
| 1000G SNPs | Training SNPs (prior=10.0) |
| Mills Indels | Training/truth indels (prior=12.0) |
Related Skills
- variant-calling - bcftools alternative
- alignment-files - BAM preprocessing
- filtering-best-practices - Post-calling filtering
- variant-normalization - Normalize before annotation
- vep-snpeff-annotation - Annotate final calls
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?