Agent skill
phylogenetic-methods
Comprehensive guide to phylogenetic tree building methods including distance-based (UPGMA, Neighbor-Joining), maximum likelihood (RAxML, IQ-TREE), and Bayesian inference (MrBayes). Covers multiple sequence alignment, distance matrices, bootstrap analysis, consensus methods, tree formats (Newick, Nexus), and tree comparison metrics. Includes implementation patterns for tree visualization and evaluation.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/phylogenetic-methods
SKILL.md
Phylogenetic Methods
Purpose
Provide comprehensive guidance for implementing phylogenetic tree building and analysis methods, covering multiple approaches from distance-based to advanced statistical methods.
When to Use
This skill activates when:
- Building phylogenetic trees
- Performing sequence alignment
- Calculating distance matrices
- Running bootstrap analysis
- Creating consensus trees
- Comparing tree topologies
- Working with tree formats (Newick, Nexus)
- Implementing tree visualization
- Evaluating tree quality and support values
Overview of Methods
Distance-Based Methods (Fast, Good for Large Datasets)
- UPGMA: Simple clustering, assumes constant evolutionary rate
- Neighbor-Joining (NJ): More accurate, handles rate variation
Maximum Likelihood (ML) Methods (Accurate, Computationally Intensive)
- RAxML-ng: Fast ML implementation
- IQ-TREE: Modern, automatic model selection
- FastTree: Approximate ML, very fast
Bayesian Methods (Most Rigorous, Very Slow)
- MrBayes: MCMC sampling of tree space
- BEAST: Time-calibrated trees
Multiple Sequence Alignment
Overview
Before building trees, sequences must be aligned to identify homologous positions.
Tools Integration
# app/services/phylo/alignment.py
import subprocess
from pathlib import Path
from typing import List, Dict
from Bio import SeqIO, AlignIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import tempfile
class MultipleSequenceAligner:
"""Align sequences using external tools."""
def __init__(
self,
method: str = "muscle",
muscle_path: str = "muscle",
mafft_path: str = "mafft"
):
self.method = method
self.muscle_path = muscle_path
self.mafft_path = mafft_path
async def align(
self,
sequences: List[Dict[str, str]],
params: Dict = None
) -> str:
"""
Align multiple sequences.
Args:
sequences: List of dicts with 'id' and 'sequence'
params: Method-specific parameters
Returns:
Aligned sequences in FASTA format
"""
if self.method == "muscle":
return await self._align_muscle(sequences, params or {})
elif self.method == "mafft":
return await self._align_mafft(sequences, params or {})
else:
raise ValueError(f"Unknown alignment method: {self.method}")
async def _align_muscle(
self,
sequences: List[Dict[str, str]],
params: Dict
) -> str:
"""Align using MUSCLE."""
with tempfile.NamedTemporaryFile(
mode='w', suffix='.fasta', delete=False
) as input_file:
# Write sequences to temp file
for seq_data in sequences:
input_file.write(f">{seq_data['id']}\n{seq_data['sequence']}\n")
input_file.flush()
with tempfile.NamedTemporaryFile(
mode='r', suffix='.fasta', delete=False
) as output_file:
# Run MUSCLE
cmd = [
self.muscle_path,
"-in", input_file.name,
"-out", output_file.name
]
# Add parameters
if params.get("max_iterations"):
cmd.extend(["-maxiters", str(params["max_iterations"])])
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=600
)
if result.returncode != 0:
raise AlignmentError(f"MUSCLE failed: {result.stderr}")
# Read aligned sequences
return Path(output_file.name).read_text()
async def _align_mafft(
self,
sequences: List[Dict[str, str]],
params: Dict
) -> str:
"""Align using MAFFT."""
with tempfile.NamedTemporaryFile(
mode='w', suffix='.fasta', delete=False
) as input_file:
for seq_data in sequences:
input_file.write(f">{seq_data['id']}\n{seq_data['sequence']}\n")
input_file.flush()
# Run MAFFT
cmd = [self.mafft_path]
# Algorithm selection
if params.get("algorithm") == "auto":
cmd.append("--auto")
elif params.get("algorithm") == "linsi":
cmd.append("--localpair")
cmd.append("--maxiterate")
cmd.append("1000")
cmd.append(input_file.name)
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=600
)
if result.returncode != 0:
raise AlignmentError(f"MAFFT failed: {result.stderr}")
return result.stdout
def validate_alignment(self, alignment_text: str) -> Dict:
"""
Validate alignment quality.
Returns:
Quality metrics
"""
alignment = AlignIO.read(
StringIO(alignment_text),
"fasta"
)
length = alignment.get_alignment_length()
num_sequences = len(alignment)
# Calculate gap percentage
gaps = sum(
1 for record in alignment
for base in str(record.seq)
if base == '-'
)
total = length * num_sequences
gap_percentage = gaps / total
# Calculate conserved positions
conserved = 0
for i in range(length):
column = alignment[:, i]
if len(set(column)) == 1:
conserved += 1
return {
"length": length,
"num_sequences": num_sequences,
"gap_percentage": gap_percentage,
"conserved_positions": conserved,
"conservation_rate": conserved / length
}
Distance-Based Methods
Distance Matrix Calculation
# app/services/phylo/distance.py
import numpy as np
from typing import List, Dict
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator
from io import StringIO
class DistanceMatrixCalculator:
"""Calculate distance matrices from alignments."""
def __init__(self, model: str = "identity"):
"""
Initialize calculator.
Args:
model: Distance model (identity, blastn, etc.)
"""
self.model = model
def calculate(self, alignment_text: str) -> np.ndarray:
"""
Calculate distance matrix.
Args:
alignment_text: Aligned sequences in FASTA format
Returns:
Distance matrix (n_sequences, n_sequences)
"""
alignment = AlignIO.read(StringIO(alignment_text), "fasta")
if self.model == "identity":
return self._identity_distance(alignment)
elif self.model == "jukes_cantor":
return self._jukes_cantor_distance(alignment)
elif self.model == "kimura":
return self._kimura_distance(alignment)
else:
# Use Biopython's calculator
calculator = DistanceCalculator(self.model)
dm = calculator.get_distance(alignment)
return np.array([dm[i] for i in range(len(dm))])
def _identity_distance(self, alignment) -> np.ndarray:
"""Simple identity-based distance."""
n = len(alignment)
matrix = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
seq1 = str(alignment[i].seq)
seq2 = str(alignment[j].seq)
# Count differences (ignore gaps)
differences = sum(
1 for a, b in zip(seq1, seq2)
if a != '-' and b != '-' and a != b
)
valid_positions = sum(
1 for a, b in zip(seq1, seq2)
if a != '-' and b != '-'
)
distance = differences / valid_positions if valid_positions > 0 else 1.0
matrix[i, j] = distance
matrix[j, i] = distance
return matrix
def _jukes_cantor_distance(self, alignment) -> np.ndarray:
"""Jukes-Cantor correction for multiple substitutions."""
identity_matrix = self._identity_distance(alignment)
# Jukes-Cantor formula: d = -3/4 * ln(1 - 4/3 * p)
# where p is the proportion of different sites
with np.errstate(divide='ignore', invalid='ignore'):
jc_matrix = -0.75 * np.log(1 - (4.0/3.0) * identity_matrix)
jc_matrix[np.isnan(jc_matrix)] = 1.0 # Handle saturation
jc_matrix[np.isinf(jc_matrix)] = 1.0
return jc_matrix
def _kimura_distance(self, alignment) -> np.ndarray:
"""Kimura 2-parameter distance."""
n = len(alignment)
matrix = np.zeros((n, n))
for i in range(n):
for j in range(i+1, n):
seq1 = str(alignment[i].seq)
seq2 = str(alignment[j].seq)
transitions = 0 # A<->G, C<->T
transversions = 0 # Other changes
valid = 0
for a, b in zip(seq1, seq2):
if a == '-' or b == '-':
continue
valid += 1
if a != b:
if (a, b) in [('A','G'), ('G','A'), ('C','T'), ('T','C')]:
transitions += 1
else:
transversions += 1
if valid == 0:
distance = 1.0
else:
P = transitions / valid # Transition proportion
Q = transversions / valid # Transversion proportion
# Kimura formula
try:
distance = -0.5 * np.log((1 - 2*P - Q) * np.sqrt(1 - 2*Q))
except:
distance = 1.0
matrix[i, j] = distance
matrix[j, i] = distance
return matrix
UPGMA Implementation
# app/services/phylo/upgma.py
import numpy as np
from typing import List, Dict, Tuple
from ete3 import Tree
class UPGMATreeBuilder:
"""Build trees using UPGMA (Unweighted Pair Group Method with Arithmetic Mean)."""
def __init__(self):
self.taxa_names = []
def build(
self,
distance_matrix: np.ndarray,
taxa_names: List[str]
) -> Tree:
"""
Build UPGMA tree.
Args:
distance_matrix: Pairwise distance matrix
taxa_names: Names of taxa (same order as matrix)
Returns:
Phylogenetic tree (ete3 Tree)
"""
self.taxa_names = taxa_names
n = len(taxa_names)
# Initialize clusters (each taxon is a cluster)
clusters = {i: Tree(name=taxa_names[i]) for i in range(n)}
cluster_sizes = {i: 1 for i in range(n)}
# Working copy of distance matrix
dm = distance_matrix.copy()
# Track active clusters
active = set(range(n))
while len(active) > 1:
# Find minimum distance
min_dist = float('inf')
merge_i, merge_j = -1, -1
for i in active:
for j in active:
if i < j and dm[i, j] < min_dist:
min_dist = dm[i, j]
merge_i, merge_j = i, j
# Create new cluster
new_cluster = Tree()
new_cluster.add_child(clusters[merge_i], dist=min_dist/2)
new_cluster.add_child(clusters[merge_j], dist=min_dist/2)
# Update clusters
new_idx = max(clusters.keys()) + 1
clusters[new_idx] = new_cluster
cluster_sizes[new_idx] = cluster_sizes[merge_i] + cluster_sizes[merge_j]
# Update distance matrix (UPGMA averaging)
new_distances = {}
for k in active:
if k != merge_i and k != merge_j:
# Average distance weighted by cluster sizes
new_dist = (
dm[merge_i, k] * cluster_sizes[merge_i] +
dm[merge_j, k] * cluster_sizes[merge_j]
) / (cluster_sizes[merge_i] + cluster_sizes[merge_j])
new_distances[k] = new_dist
# Add new row/column to matrix
for k, dist in new_distances.items():
dm[new_idx, k] = dist
dm[k, new_idx] = dist
# Remove merged clusters
active.remove(merge_i)
active.remove(merge_j)
active.add(new_idx)
# Return root
return clusters[list(active)[0]]
Neighbor-Joining Implementation
# app/services/phylo/neighbor_joining.py
import numpy as np
from ete3 import Tree
class NeighborJoiningTreeBuilder:
"""Build trees using Neighbor-Joining algorithm."""
def build(
self,
distance_matrix: np.ndarray,
taxa_names: List[str]
) -> Tree:
"""
Build Neighbor-Joining tree.
Args:
distance_matrix: Pairwise distance matrix
taxa_names: Names of taxa
Returns:
Phylogenetic tree
"""
n = len(taxa_names)
dm = distance_matrix.copy()
# Initialize nodes
nodes = {i: Tree(name=taxa_names[i]) for i in range(n)}
active = set(range(n))
while len(active) > 2:
# Calculate net divergence (r)
r = {}
for i in active:
r[i] = sum(dm[i, j] for j in active if j != i)
# Calculate Q matrix
Q = np.full_like(dm, float('inf'))
for i in active:
for j in active:
if i < j:
Q[i, j] = (len(active) - 2) * dm[i, j] - r[i] - r[j]
# Find minimum Q
min_q = float('inf')
merge_i, merge_j = -1, -1
for i in active:
for j in active:
if i < j and Q[i, j] < min_q:
min_q = Q[i, j]
merge_i, merge_j = i, j
# Calculate branch lengths
dist_i = dm[merge_i, merge_j] / 2 + (r[merge_i] - r[merge_j]) / (2 * (len(active) - 2))
dist_j = dm[merge_i, merge_j] - dist_i
# Create new node
new_node = Tree()
new_node.add_child(nodes[merge_i], dist=max(0, dist_i))
new_node.add_child(nodes[merge_j], dist=max(0, dist_j))
# Update nodes
new_idx = max(nodes.keys()) + 1
nodes[new_idx] = new_node
# Update distance matrix
for k in active:
if k != merge_i and k != merge_j:
new_dist = (dm[merge_i, k] + dm[merge_j, k] - dm[merge_i, merge_j]) / 2
dm[new_idx, k] = new_dist
dm[k, new_idx] = new_dist
# Remove merged nodes
active.remove(merge_i)
active.remove(merge_j)
active.add(new_idx)
# Connect last two nodes
remaining = list(active)
if len(remaining) == 2:
i, j = remaining
root = Tree()
root.add_child(nodes[i], dist=dm[i, j]/2)
root.add_child(nodes[j], dist=dm[i, j]/2)
return root
return nodes[remaining[0]]
Maximum Likelihood Methods
RAxML-ng Integration
# app/services/phylo/raxml.py
import subprocess
from pathlib import Path
import tempfile
from ete3 import Tree
class RAxMLTreeBuilder:
"""Build ML trees using RAxML-ng."""
def __init__(self, raxml_path: str = "raxml-ng"):
self.raxml_path = raxml_path
async def build(
self,
alignment_text: str,
model: str = "GTR+G",
num_searches: int = 10,
num_bootstrap: int = 100
) -> Dict:
"""
Build ML tree with bootstrap support.
Args:
alignment_text: Aligned sequences (FASTA)
model: Substitution model
num_searches: Number of ML searches
num_bootstrap: Number of bootstrap replicates
Returns:
Dict with best tree and bootstrap tree
"""
with tempfile.TemporaryDirectory() as tmpdir:
tmpdir = Path(tmpdir)
# Write alignment
aln_file = tmpdir / "alignment.fasta"
aln_file.write_text(alignment_text)
# Run ML search
cmd = [
self.raxml_path,
"--msa", str(aln_file),
"--model", model,
"--search",
"--tree", f"pars{{{num_searches}}}",
"--prefix", str(tmpdir / "ml")
]
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=3600
)
if result.returncode != 0:
raise TreeBuildingError(f"RAxML failed: {result.stderr}")
# Run bootstrap
cmd_bs = [
self.raxml_path,
"--msa", str(aln_file),
"--model", model,
"--bootstrap",
"--bs-trees", str(num_bootstrap),
"--prefix", str(tmpdir / "bootstrap")
]
subprocess.run(cmd_bs, capture_output=True, timeout=3600)
# Map bootstrap support to best tree
cmd_support = [
self.raxml_path,
"--support",
"--tree", str(tmpdir / "ml.raxml.bestTree"),
"--bs-trees", str(tmpdir / "bootstrap.raxml.bootstraps"),
"--prefix", str(tmpdir / "support")
]
subprocess.run(cmd_support, capture_output=True, timeout=300)
# Read results
best_tree = Tree(str(tmpdir / "support.raxml.support"))
return {
"tree": best_tree,
"log_likelihood": self._parse_likelihood(
(tmpdir / "ml.raxml.log").read_text()
)
}
def _parse_likelihood(self, log_text: str) -> float:
"""Parse log likelihood from RAxML log."""
for line in log_text.split('\n'):
if "Final LogLikelihood:" in line:
return float(line.split()[-1])
return None
Bootstrap Analysis
# app/services/phylo/bootstrap.py
import numpy as np
from typing import List
from Bio import AlignIO
from io import StringIO
class BootstrapAnalyzer:
"""Perform bootstrap analysis for tree support."""
def __init__(self, tree_builder):
self.tree_builder = tree_builder
async def bootstrap(
self,
alignment_text: str,
num_replicates: int = 100,
method: str = "nj"
) -> List[Tree]:
"""
Generate bootstrap replicates.
Args:
alignment_text: Original alignment
num_replicates: Number of bootstrap samples
method: Tree building method
Returns:
List of bootstrap trees
"""
alignment = AlignIO.read(StringIO(alignment_text), "fasta")
length = alignment.get_alignment_length()
trees = []
for _ in range(num_replicates):
# Resample columns with replacement
indices = np.random.choice(length, size=length, replace=True)
# Create bootstrap alignment
bootstrap_aln = self._resample_alignment(alignment, indices)
# Build tree
tree = await self.tree_builder.build(bootstrap_aln)
trees.append(tree)
return trees
def _resample_alignment(self, alignment, indices):
"""Create resampled alignment."""
from Bio.Align import MultipleSeqAlignment
new_records = []
for record in alignment:
seq = str(record.seq)
new_seq = ''.join(seq[i] for i in indices)
new_records.append(
SeqRecord(Seq(new_seq), id=record.id, description="")
)
return MultipleSeqAlignment(new_records)
def calculate_support_values(
self,
best_tree: Tree,
bootstrap_trees: List[Tree]
) -> Tree:
"""
Map bootstrap support to best tree.
Args:
best_tree: Best tree from original data
bootstrap_trees: Trees from bootstrap replicates
Returns:
Tree with support values
"""
# Get bipartitions from best tree
best_bipartitions = self._get_bipartitions(best_tree)
# Count occurrences in bootstrap trees
support_counts = {bp: 0 for bp in best_bipartitions}
for bs_tree in bootstrap_trees:
bs_bipartitions = self._get_bipartitions(bs_tree)
for bp in best_bipartitions:
if bp in bs_bipartitions:
support_counts[bp] += 1
# Add support values to tree
for node in best_tree.traverse():
if not node.is_leaf():
leaves = frozenset(l.name for l in node.get_leaves())
if leaves in support_counts:
node.support = support_counts[leaves] / len(bootstrap_trees)
return best_tree
def _get_bipartitions(self, tree: Tree) -> List[frozenset]:
"""Extract bipartitions from tree."""
bipartitions = []
for node in tree.traverse():
if not node.is_leaf():
leaves = frozenset(l.name for l in node.get_leaves())
bipartitions.append(leaves)
return bipartitions
Tree Formats
Newick Format
# app/services/phylo/formats.py
from ete3 import Tree
class TreeFormatter:
"""Convert between tree formats."""
@staticmethod
def to_newick(tree: Tree, include_support: bool = True) -> str:
"""
Convert tree to Newick format.
Args:
tree: ete3 Tree object
include_support: Include support values
Returns:
Newick string
"""
if include_support:
return tree.write(format=0) # format 0: branch length + support
else:
return tree.write(format=5) # format 5: branch length only
@staticmethod
def from_newick(newick_str: str) -> Tree:
"""Parse Newick string."""
return Tree(newick_str, format=1)
@staticmethod
def to_nexus(tree: Tree, taxa_names: List[str]) -> str:
"""Convert to NEXUS format."""
nexus = "#NEXUS\n\n"
nexus += "BEGIN TAXA;\n"
nexus += f" DIMENSIONS NTAX={len(taxa_names)};\n"
nexus += " TAXLABELS\n"
for name in taxa_names:
nexus += f" {name}\n"
nexus += " ;\n"
nexus += "END;\n\n"
nexus += "BEGIN TREES;\n"
nexus += f" TREE tree1 = {tree.write(format=0)}\n"
nexus += "END;\n"
return nexus
Best Practices
✅ DO
- Use appropriate substitution models for your data
- Perform bootstrap analysis for support values
- Compare multiple tree-building methods
- Validate alignment quality before tree building
- Root trees appropriately (outgroup or midpoint)
- Calculate branch support values
- Document method parameters
- Test with known phylogenies
- Use consensus methods for multiple trees
- Consider computational cost vs accuracy trade-offs
❌ DON'T
- Skip sequence alignment
- Ignore alignment gaps (handle properly)
- Use wrong distance model
- Over-interpret low bootstrap values (<70%)
- Compare trees without proper metrics
- Ignore unrooted vs rooted trees
- Skip model selection
- Trust single-method trees for critical work
- Ignore branch lengths
- Use distance methods for deep phylogenies
Related Skills: rRNA-prediction-patterns, ml-integration-patterns
Line Count: < 500 lines ✅
Didn't find tool you were looking for?