Agent skill
bio-genome-intervals-gtf-gff-handling
Install this agent skill to your Project
npx add-skill https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills/tree/main/skills/bio-genome-intervals-gtf-gff-handling
SKILL.md
name: bio-genome-intervals-gtf-gff-handling description: Parse, query, and convert GTF and GFF3 annotation files. Extract gene, transcript, and exon coordinates using gffread, gtfparse, and gffutils. Use when extracting specific features from gene annotations or converting between annotation formats. tool_type: mixed primary_tool: gffread measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
GTF/GFF Handling
GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.
Format Comparison
| Feature | GTF | GFF3 |
|---|---|---|
| Coordinate system | 1-based, inclusive | 1-based, inclusive |
| Hierarchy | Implicit (gene_id, transcript_id) | Explicit (Parent attribute) |
| Attribute format | key "value"; | key=value; |
| Comments | # | # |
| Fasta sequences | Not standard | ##FASTA directive |
GTF Format
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";
GFF3 Format
chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972;Name=DDX11L1
chr1 HAVANA mRNA 11869 14409 . + . ID=ENST00000456328;Parent=ENSG00000223972
chr1 HAVANA exon 11869 12227 . + . ID=exon1;Parent=ENST00000456328
Parse GTF with gtfparse (Python)
Installation
pip install gtfparse
Basic Parsing
import gtfparse
# Load entire GTF
df = gtfparse.read_gtf('annotation.gtf')
# View columns
print(df.columns)
# ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame',
# 'gene_id', 'transcript_id', 'gene_name', ...]
# Filter by feature type
genes = df[df['feature'] == 'gene']
transcripts = df[df['feature'] == 'transcript']
exons = df[df['feature'] == 'exon']
# Get specific gene
gene_df = df[df['gene_name'] == 'TP53']
Extract Gene Coordinates
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
# All genes
genes = df[df['feature'] == 'gene'][['seqname', 'start', 'end', 'strand', 'gene_id', 'gene_name']]
# Convert to BED format (0-based)
genes_bed = genes.copy()
genes_bed['start'] = genes_bed['start'] - 1 # GTF is 1-based, BED is 0-based
genes_bed = genes_bed[['seqname', 'start', 'end', 'gene_name', 'gene_id', 'strand']]
genes_bed.to_csv('genes.bed', sep='\t', header=False, index=False)
Get Exons for Gene
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
# Get all exons for TP53
tp53_exons = df[(df['gene_name'] == 'TP53') & (df['feature'] == 'exon')]
tp53_exons = tp53_exons[['seqname', 'start', 'end', 'transcript_id', 'exon_number']]
print(tp53_exons)
Parse GFF with gffutils (Python)
Installation
pip install gffutils
Create Database
import gffutils
# Create database (slow first time, fast for subsequent queries)
db = gffutils.create_db('annotation.gff3', 'annotation.db',
force=True, merge_strategy='create_unique')
# Or load existing database
db = gffutils.FeatureDB('annotation.db')
Query Features
import gffutils
db = gffutils.FeatureDB('annotation.db')
# Count features by type
for featuretype in db.featuretypes():
count = db.count_features_of_type(featuretype)
print(f'{featuretype}: {count}')
# Get all genes
for gene in db.features_of_type('gene'):
print(f'{gene.id}: {gene.seqid}:{gene.start}-{gene.end}')
# Get gene by ID
gene = db['ENSG00000141510'] # TP53
print(f'{gene.attributes["Name"][0]}: {gene.seqid}:{gene.start}-{gene.end}')
# Get children (transcripts, exons)
for transcript in db.children(gene, featuretype='mRNA'):
print(f' Transcript: {transcript.id}')
for exon in db.children(transcript, featuretype='exon'):
print(f' Exon: {exon.start}-{exon.end}')
Get Introns
import gffutils
db = gffutils.FeatureDB('annotation.db')
# Get introns for a transcript
transcript = db['ENST00000269305']
introns = list(db.interfeatures(db.children(transcript, featuretype='exon'),
new_featuretype='intron'))
for intron in introns:
print(f'Intron: {intron.start}-{intron.end}')
Convert Formats with gffread (CLI)
Installation
conda install -c bioconda gffread
GTF to GFF3
gffread annotation.gtf -o annotation.gff3
GFF3 to GTF
gffread annotation.gff3 -T -o annotation.gtf
Extract Sequences
# Extract transcript sequences
gffread -w transcripts.fa -g genome.fa annotation.gtf
# Extract CDS sequences
gffread -x cds.fa -g genome.fa annotation.gtf
# Extract protein sequences
gffread -y proteins.fa -g genome.fa annotation.gtf
Filter Features
# Keep only protein-coding genes
gffread annotation.gtf -C -o coding.gtf
# Keep specific gene types
gffread annotation.gtf --keep-genes=protein_coding -o coding.gtf
Extract Regions with bedtools
Get Promoters
# Extract TSS (transcript start sites)
awk '$3 == "transcript"' annotation.gtf | \
awk -v OFS='\t' '{
if ($7 == "+") print $1, $4-1, $4, ".", ".", $7;
else print $1, $5-1, $5, ".", ".", $7;
}' > tss.bed
# Get promoter regions (2kb upstream of TSS)
bedtools flank -i tss.bed -g genome.txt -l 2000 -r 0 -s > promoters.bed
Get Gene Bodies
# Extract gene coordinates to BED
awk '$3 == "gene"' annotation.gtf | \
awk -v OFS='\t' '{
split($0, a, "gene_id \""); split(a[2], b, "\"");
print $1, $4-1, $5, b[1], ".", $7;
}' > genes.bed
Get Exons
# Extract unique exons
awk '$3 == "exon"' annotation.gtf | \
awk -v OFS='\t' '{print $1, $4-1, $5, ".", ".", $7}' | \
sort -k1,1 -k2,2n | uniq > exons.bed
Python: GTF to BED Conversion
import gtfparse
import pandas as pd
def gtf_to_bed(gtf_path, feature_type='gene', output_path=None):
'''Convert GTF features to BED format.'''
df = gtfparse.read_gtf(gtf_path)
features = df[df['feature'] == feature_type].copy()
# Convert to 0-based coordinates
bed = pd.DataFrame({
'chrom': features['seqname'],
'start': features['start'] - 1,
'end': features['end'],
'name': features.get('gene_name', features.get('gene_id', '.')),
'score': 0,
'strand': features['strand']
})
if output_path:
bed.to_csv(output_path, sep='\t', header=False, index=False)
return bed
# Usage
genes_bed = gtf_to_bed('annotation.gtf', 'gene', 'genes.bed')
exons_bed = gtf_to_bed('annotation.gtf', 'exon', 'exons.bed')
Validate GTF/GFF
# Check GTF format
gffread -E annotation.gtf
# Check GFF3 format
gffread -E annotation.gff3
# Detailed validation
gt gff3validator annotation.gff3 # requires genometools
Common Attributes
GTF Attributes
| Attribute | Description |
|---|---|
| gene_id | Ensembl gene ID |
| gene_name | Gene symbol |
| gene_biotype | protein_coding, lncRNA, etc. |
| transcript_id | Ensembl transcript ID |
| transcript_name | Transcript symbol |
| exon_number | Exon position in transcript |
| exon_id | Ensembl exon ID |
GFF3 Attributes
| Attribute | Description |
|---|---|
| ID | Unique feature identifier |
| Name | Display name |
| Parent | Parent feature ID |
| Dbxref | Database cross-references |
| gene_biotype | Gene type |
Memory-Efficient Processing
import gtfparse
# Process large files in chunks (gtfparse loads all into memory)
# For very large files, use gffutils database approach
# Or filter during parsing
df = gtfparse.read_gtf('annotation.gtf',
features=['gene', 'exon']) # Only load specific features
Related Skills
- bed-file-basics - BED format and conversion
- interval-arithmetic - Gene/exon overlap analysis
- proximity-operations - TSS proximity analysis
- differential-expression/de-results - Gene coordinate mapping
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
vcf-annotator
Annotate VCF variants with VEP, ClinVar, gnomAD frequencies, and ancestry-aware context. Generates prioritised variant reports.
chemist-analyst
Analyzes events through chemistry lens using molecular structure, reaction mechanisms, thermodynamics, kinetics, and analytical techniques (spectroscopy, chromatography, mass spectrometry). Provides insights on chemical processes, material properties, reaction pathways, synthesis, and analytical methods. Use when: Chemical reactions, material analysis, synthesis planning, process optimization, environmental chemistry. Evaluates: Molecular structure, reaction mechanisms, yield, selectivity, safety, environmental impact.
bio-alignment-io
Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
sleep-analyzer
分析睡眠数据、识别睡眠模式、评估睡眠质量,并提供个性化睡眠改善建议。支持与其他健康数据的关联分析。
metabolomics-workbench-database
Access NIH Metabolomics Workbench via REST API (4,200+ studies). Query metabolites, RefMet nomenclature, MS/NMR data, m/z searches, study metadata, for metabolomics and biomarker discovery.
bio-hi-c-analysis-matrix-operations
Balance, normalize, and transform Hi-C contact matrices using cooler and cooltools. Apply iterative correction (ICE), compute expected values, and generate observed/expected matrices. Use when normalizing or transforming Hi-C matrices.
Didn't find tool you were looking for?