Agent skill
bio-genome-assembly-hifi-assembly
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-genome-assembly-hifi-assembly
SKILL.md
name: bio-genome-assembly-hifi-assembly description: High-quality genome assembly from PacBio HiFi reads using hifiasm with phasing support. Use when building reference-quality diploid assemblies from HiFi data, especially with trio or Hi-C phasing for fully resolved haplotypes. tool_type: cli primary_tool: hifiasm measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
HiFi Assembly
Basic Assembly
# Primary assembly (single haplotype consensus)
hifiasm -o output_prefix -t 32 reads.hifi.fastq.gz
# Output files:
# output_prefix.bp.p_ctg.gfa - Primary contigs
# output_prefix.bp.a_ctg.gfa - Alternate contigs
# output_prefix.bp.hap1.p_ctg.gfa - Haplotype 1 (if phased)
# output_prefix.bp.hap2.p_ctg.gfa - Haplotype 2 (if phased)
# Convert GFA to FASTA
awk '/^S/{print ">"$2;print $3}' output_prefix.bp.p_ctg.gfa > assembly.fasta
Trio-Binned Phasing
# With parental short reads for trio binning
hifiasm -o trio_asm -t 32 \
-1 paternal.yak \
-2 maternal.yak \
child.hifi.fastq.gz
# Create yak databases from parental Illumina reads first
yak count -b37 -t16 -o paternal.yak paternal_R1.fq.gz paternal_R2.fq.gz
yak count -b37 -t16 -o maternal.yak maternal_R1.fq.gz maternal_R2.fq.gz
Hi-C Phasing
# Use Hi-C reads for phasing (no parents needed)
hifiasm -o hic_asm -t 32 \
--h1 hic_R1.fastq.gz \
--h2 hic_R2.fastq.gz \
reads.hifi.fastq.gz
# Produces fully phased hap1 and hap2 assemblies
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| -t | 1 | Threads |
| -l | 0 | Purge level (0=none, 1=light, 2=aggressive) |
| -s | 0.55 | Similarity threshold for duplicate detection |
| --primary | - | Output primary contigs only (no alternates) |
| --n-hap | 2 | Expected number of haplotypes |
| -D | 5.0 | Drop reads with depth > D*average |
| -N | 100 | Consider up to N overlaps for each read |
Purge Duplicates
# Aggressive purging for high heterozygosity
hifiasm -o asm -t 32 -l 2 reads.hifi.fastq.gz
# Minimal purging for inbred samples
hifiasm -o asm -t 32 -l 0 reads.hifi.fastq.gz
Ultra-Long ONT Integration
# Combine HiFi accuracy with ONT length
hifiasm -o hybrid_asm -t 32 \
--ul ont_ultralong.fastq.gz \
hifi_reads.fastq.gz
# UL reads help span complex repeats
Assembly Stats
# Quick stats with seqkit
seqkit stats assembly.fasta
# Detailed with assembly-stats
assembly-stats assembly.fasta
# QUAST assessment
quast.py -o quast_output assembly.fasta
# BUSCO completeness
busco -i assembly.fasta -l mammalia_odb10 -o busco_out -m genome
Memory and Runtime
| Genome Size | HiFi Coverage | RAM | Time (32 cores) |
|---|---|---|---|
| 3 Gb | 30x | ~200 GB | 12-24 hours |
| 3 Gb | 60x | ~400 GB | 24-48 hours |
| 500 Mb | 40x | ~64 GB | 2-4 hours |
Python Wrapper
import subprocess
from pathlib import Path
def run_hifiasm(hifi_reads, output_prefix, threads=32, purge_level=0,
hic_r1=None, hic_r2=None, ul_reads=None):
cmd = ['hifiasm', '-o', output_prefix, '-t', str(threads), '-l', str(purge_level)]
if hic_r1 and hic_r2:
cmd.extend(['--h1', hic_r1, '--h2', hic_r2])
if ul_reads:
cmd.extend(['--ul', ul_reads])
cmd.append(hifi_reads)
subprocess.run(cmd, check=True)
gfa = Path(f'{output_prefix}.bp.p_ctg.gfa')
fasta = Path(f'{output_prefix}.fasta')
with open(fasta, 'w') as out:
with open(gfa) as f:
for line in f:
if line.startswith('S'):
parts = line.strip().split('\t')
out.write(f'>{parts[1]}\n{parts[2]}\n')
return fasta
# Example
assembly = run_hifiasm('sample.hifi.fq.gz', 'sample_asm', threads=48, hic_r1='hic_R1.fq.gz', hic_r2='hic_R2.fq.gz')
Troubleshooting
| Issue | Solution |
|---|---|
| High duplication | Increase purge level (-l 2) |
| Missing haplotypes | Add Hi-C or trio data for phasing |
| Memory errors | Reduce -D parameter or downsample reads |
| Fragmented assembly | Check read quality; consider UL ONT addition |
Related Skills
- genome-assembly/assembly-qc - QUAST and BUSCO
- genome-assembly/scaffolding - YaHS Hi-C scaffolding
- genome-assembly/contamination-detection - CheckM2 decontamination
- long-read-sequencing/read-qc - HiFi quality control
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?