Agent skill

python-performance-patterns

Low-risk Python performance optimization patterns with verified speedups

Stars 163
Forks 31

Install this agent skill to your Project

npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/development/python-performance-patterns-smith6jt-cop-skills-registry

SKILL.md

Python Performance Patterns

Experiment Overview

Item Details
Date 2025-12-12
Goal Implement safe, high-impact performance optimizations from audit
Environment Python 3.10+, NumPy, scikit-image
Status Success - 5 patterns verified

Context

After a performance audit identified 40+ issues, we needed to prioritize which fixes were safe to implement without extensive testing. These patterns are low-risk, high-reward optimizations.

Pattern 1: @lru_cache for File Metadata

Problem: Functions that check file types open files repeatedly.

Before (3-5 file opens per call):

python
def get_img_type(img_f):
    is_ome = check_is_ome(img_f)      # Opens file
    can_use_vips = check_to_use_vips(img_f)  # Opens file
    can_use_openslide = check_to_use_openslide(img_f)  # Opens file

After (cached after first call):

python
from functools import lru_cache

@lru_cache(maxsize=256)
def check_is_ome(src_f):
    # ... implementation

@lru_cache(maxsize=256)
def check_to_use_vips(src_f):
    # ... implementation

@lru_cache(maxsize=256)
def get_img_type(img_f):
    # ... implementation

Speedup: 3-5x for batch file operations

Safe because: File metadata doesn't change during a session


Pattern 2: Vectorized Overlap Matrix

Problem: O(n²) nested loop for pixel-by-pixel counting.

Before:

python
overlap = np.zeros((max_label1 + 1, max_label2 + 1), dtype=np.int32)
for i in range(labels1.shape[0]):
    for j in range(labels1.shape[1]):
        l1 = labels1[i, j]
        l2 = labels2[i, j]
        overlap[l1, l2] += 1

After:

python
overlap = np.zeros((max_label1 + 1, max_label2 + 1), dtype=np.int64)
np.add.at(overlap, (labels1.ravel(), labels2.ravel()), 1)

Speedup: 5-10x for large label arrays

Safe because: Mathematically equivalent, uses NumPy built-in


Pattern 3: argpartition for Top-K

Problem: Full sort when only need k smallest/largest values.

Before (O(n log n)):

python
neighbor_indices = np.argsort(distances[i])[1:n_neighbors+1]

After (O(n)):

python
neighbor_indices = np.argpartition(distances[i], n_neighbors+1)[1:n_neighbors+1]

Speedup: 3-5x for large arrays

Safe because: Returns same indices, just not sorted (usually fine for k-NN)

Caveat: If you need the k values in sorted order, add a second sort on just k elements.


Pattern 4: Parallel Image Loading

Problem: Sequential I/O when loading many images.

Before:

python
img_list = [io.imread(os.path.join(src_dir, f)) for f in img_f_list]

After:

python
from concurrent.futures import ThreadPoolExecutor

img_paths = [os.path.join(src_dir, f) for f in img_f_list]
with ThreadPoolExecutor(max_workers=min(8, len(img_paths))) as executor:
    img_list = list(executor.map(io.imread, img_paths))

Speedup: 10-20x for 50+ images

Safe because: Image reads are independent, thread-safe


Pattern 5: Single Directory Scan

Problem: Multiple glob calls scan directory repeatedly.

Before (5 filesystem scans):

python
patterns = ["*.tif", "*.tiff", "*.zarr", "*.ome.tif", "*.ome.tiff"]
for pattern in patterns:
    for f in cycle_dir.glob(pattern):
        process(f)

After (1 filesystem scan):

python
valid_extensions = {".tif", ".tiff", ".zarr"}
valid_suffixes = {".ome.tif", ".ome.tiff"}
for f in cycle_dir.iterdir():
    suffix_lower = f.suffix.lower()
    name_lower = f.name.lower()
    if suffix_lower in valid_extensions or any(name_lower.endswith(s) for s in valid_suffixes):
        process(f)

Speedup: 2-3x for directories with many files

Safe because: Same files matched, just more efficiently


Failed Attempts (Critical)

Attempt Why it Failed Lesson Learned
Caching image processing results Wrong results for different inputs Only cache metadata, not computed results
Caching BaSiC illumination profiles 15-20% intensity errors for sparse markers Each channel needs its own profile (see basic-caching-evaluation skill)
Parallelizing CPU-bound NumPy GIL contention, no speedup ThreadPoolExecutor only for I/O-bound tasks
Removing "unnecessary" array copies Broke downstream code expecting copies Test thoroughly before removing .copy()

Priority Matrix

Priority Pattern Risk Speedup
P0 Parallel image loading Low 10-20x
P1 @lru_cache metadata Low 3-5x
P1 Vectorized overlap Low 5-10x
P2 argpartition Low 3-5x
P2 Single directory scan Low 2-3x
P3 Remove dask copies Medium 2x memory

When NOT to Apply These Patterns

  1. @lru_cache: Don't cache if function has side effects or file might change
  2. argpartition: Don't use if you need results in sorted order
  3. ThreadPoolExecutor: Don't use for CPU-bound operations (use ProcessPoolExecutor)
  4. Removing copies: Don't remove if downstream code modifies the array

Key Insights

  • Start with I/O optimizations - biggest wins with lowest risk
  • Vectorization beats loops in NumPy, always
  • Profile before optimizing - intuition is often wrong
  • Test numerical accuracy after optimization, not just correctness

References

  • NumPy performance tips documentation
  • Python concurrent.futures documentation
  • KINTSUGI Performance Audit (2025-12-12)

Didn't find tool you were looking for?

Be as detailed as possible for better results