Agent skill

fastq-analysis-pipeline

Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.

Stars 2,009
Forks 275

Install this agent skill to your Project

npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/fastq-analysis

SKILL.md

Overview

OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the ov.alignment module. This skill covers:

  • SRA data acquisition: prefetch and fqdump (fasterq-dump wrapper)
  • Quality control: fastp for adapter trimming and QC reports
  • RNA-seq alignment: STAR aligner with auto-index building
  • Gene quantification: featureCount (subread featureCounts wrapper)
  • Single-cell path: ref and count via kb-python (kallisto/bustools)
  • Parallel SRA download: parallel_fastq_dump

All functions share a common CLI infrastructure (_cli_utils.py) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.

Instructions

  1. Environment setup

    • Bioinformatics tools are resolved automatically from PATH or the active conda environment.
    • If auto_install=True (default), missing tools are installed via mamba/conda on demand.
    • Supported tools: prefetch, vdb-validate, fasterq-dump, fastp, STAR, samtools, featureCounts, pigz, gzip.
    • For the single-cell path, ensure kb-python is installed: pip install kb-python.
  2. SRA data download (ov.alignment.prefetch + ov.alignment.fqdump)

    • Use prefetch first for reliable downloads with integrity validation (vdb-validate).
    • Then convert to FASTQ with fqdump. It auto-detects single-end vs paired-end.
    • fqdump can also work directly from SRR accessions without prefetch.
    • Both support retry with exponential backoff for network errors.
    python
    import omicverse as ov
    
    # Step 1: Prefetch SRA files (optional but recommended)
    pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)
    
    # Step 2: Convert to FASTQ
    fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
                              output_dir='fastq', sra_dir='prefetch',
                              gzip=True, threads=8, jobs=4)
    
  3. FASTQ quality control (ov.alignment.fastp)

    • Runs fastp for adapter trimming, quality filtering, and QC reporting.
    • Supports single-end and paired-end reads.
    • Produces per-sample JSON and HTML QC reports.
    • Sample format: tuple of (sample_name, fq1_path, fq2_path_or_None).
    python
    samples = [
        ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
        ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
    ]
    clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)
    
  4. STAR alignment (ov.alignment.STAR)

    • Aligns FASTQ reads using the STAR aligner.
    • Auto-index building: set auto_index=True (default) with genome_fasta_files and gtf to build index automatically if missing.
    • Produces coordinate-sorted BAM files.
    • Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
    • Use strict=False (default) for graceful error handling per sample.
    python
    # Prepare samples from fastp output
    star_samples = [
        ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
        ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
    ]
    bams = ov.alignment.STAR(
        star_samples,
        genome_dir='star_index',
        output_dir='star_out',
        gtf='genes.gtf',
        genome_fasta_files=['genome.fa'],
        threads=8,
        memory='50G',
    )
    
  5. Gene quantification (ov.alignment.featureCount)

    • Counts aligned reads per gene using featureCounts (subread).
    • Auto-detects paired-end from BAM headers (via pysam or samtools).
    • auto_fix=True (default) retries with corrected paired-end flag on error.
    • gene_mapping=True maps gene_id to gene_name from the GTF.
    • merge_matrix=True produces a combined count matrix across all samples.
    python
    bam_items = [
        ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
        ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
    ]
    counts = ov.alignment.featureCount(
        bam_items,
        gtf='genes.gtf',
        output_dir='counts',
        gene_mapping=True,
        merge_matrix=True,
        threads=8,
    )
    # counts is a pandas DataFrame (gene_id x samples)
    
  6. Single-cell path (ov.alignment.ref + ov.alignment.count)

    • Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
    • ref() builds a kallisto index and transcript-to-gene mapping.
    • count() quantifies single-cell data with barcode/UMI handling.
    • Supports technologies: 10XV2, 10XV3, BULK, and custom.
    • Output formats: h5ad, loom, cellranger MTX.
    python
    # Build reference index
    ref_result = ov.alignment.ref(
        index_path='kb_ref/index.idx',
        t2g_path='kb_ref/t2g.txt',
        fasta_paths=['genome.fa'],
        gtf_paths=['genes.gtf'],
        threads=8,
    )
    
    # Quantify 10x v3 data
    count_result = ov.alignment.count(
        index_path='kb_ref/index.idx',
        t2g_path='kb_ref/t2g.txt',
        technology='10XV3',
        fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
        output_path='kb_out',
        h5ad=True,
        filter_barcodes=True,
        threads=8,
    )
    
  7. Wiring fastp output into STAR input

    • fastp output is a list of dicts with keys: sample, clean1, clean2, json, html.
    • Convert to STAR sample tuples:
    python
    star_samples = [
        (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
        for r in (clean if isinstance(clean, list) else [clean])
    ]
    
  8. Wiring STAR output into featureCount input

    • STAR output is a list of dicts with keys: sample, bam (or error).
    • Convert to featureCount items:
    python
    bam_items = [
        (r['sample'], r['bam'])
        for r in (bams if isinstance(bams, list) else [bams])
        if 'bam' in r
    ]
    
  9. Skipping completed steps

    • All functions check for existing outputs and skip if overwrite=False (default).
    • Set overwrite=True to force re-execution.
  10. Troubleshooting

    • If a tool is not found, check auto_install=True and that conda/mamba is accessible.
    • For STAR index errors, ensure genome_fasta_files points to uncompressed or gzip FASTA files.
    • For featureCounts paired-end detection errors, auto_fix=True handles most cases automatically.
    • GTF files can be gzip-compressed; they are auto-decompressed as needed.

Critical API Reference

Sample Format Convention

All alignment functions use a consistent sample tuple format:

  • FASTQ samples: (sample_name, fq1_path, fq2_path_or_None)
  • BAM items: (sample_name, bam_path) or (sample_name, bam_path, is_paired_bool)
  • Single samples can be passed as a single tuple; multiple as a list of tuples.
  • When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.

Auto-installation

python
# All functions support these parameters:
auto_install=True   # Auto-install missing tools via conda/mamba
overwrite=False     # Skip if outputs already exist
threads=8           # Per-tool thread count
jobs=None           # Concurrent job count (auto-detected from CPU count)

Examples

  • Bulk RNA-seq from SRA: prefetch -> fqdump -> fastp -> STAR -> featureCount -> pandas DataFrame
  • Single-cell 10x v3: ref -> count with technology='10XV3' -> h5ad AnnData
  • Local FASTQ files: Skip download steps, start directly with fastp -> STAR -> featureCount

References

  • See reference.md for copy-paste-ready code templates.

Expand your agent's capabilities with these related and highly-rated skills.

FreedomIntelligence/OpenClaw-Medical-Skills

vcf-annotator

Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

chemist-analyst

Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

bio-alignment-io

Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

sleep-analyzer

分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

metabolomics-workbench-database

Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.

2,009 275
Explore
FreedomIntelligence/OpenClaw-Medical-Skills

bio-hi-c-analysis-matrix-operations

Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.

2,009 275
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results