Python Performance Patterns

Experiment Overview

Item	Details
Date	2025-12-12
Goal	Implement safe, high-impact performance optimizations from audit
Environment	Python 3.10+, NumPy, scikit-image
Status	Success - 5 patterns verified

Context

After a performance audit identified 40+ issues, we needed to prioritize which fixes were safe to implement without extensive testing. These patterns are low-risk, high-reward optimizations.

Pattern 1: @lru_cache for File Metadata

Problem: Functions that check file types open files repeatedly.

Before (3-5 file opens per call):

python

def get_img_type(img_f):
    is_ome = check_is_ome(img_f)      # Opens file
    can_use_vips = check_to_use_vips(img_f)  # Opens file
    can_use_openslide = check_to_use_openslide(img_f)  # Opens file

After (cached after first call):

python

from functools import lru_cache

@lru_cache(maxsize=256)
def check_is_ome(src_f):
    # ... implementation

@lru_cache(maxsize=256)
def check_to_use_vips(src_f):
    # ... implementation

@lru_cache(maxsize=256)
def get_img_type(img_f):
    # ... implementation

Speedup: 3-5x for batch file operations

Safe because: File metadata doesn't change during a session

Pattern 2: Vectorized Overlap Matrix

Problem: O(n²) nested loop for pixel-by-pixel counting.

Before:

python

overlap = np.zeros((max_label1 + 1, max_label2 + 1), dtype=np.int32)
for i in range(labels1.shape[0]):
    for j in range(labels1.shape[1]):
        l1 = labels1[i, j]
        l2 = labels2[i, j]
        overlap[l1, l2] += 1

After:

python

overlap = np.zeros((max_label1 + 1, max_label2 + 1), dtype=np.int64)
np.add.at(overlap, (labels1.ravel(), labels2.ravel()), 1)

Speedup: 5-10x for large label arrays

Safe because: Mathematically equivalent, uses NumPy built-in

Pattern 3: argpartition for Top-K

Problem: Full sort when only need k smallest/largest values.

Before (O(n log n)):

python

neighbor_indices = np.argsort(distances[i])[1:n_neighbors+1]

After (O(n)):

python

neighbor_indices = np.argpartition(distances[i], n_neighbors+1)[1:n_neighbors+1]

Speedup: 3-5x for large arrays

Safe because: Returns same indices, just not sorted (usually fine for k-NN)

Caveat: If you need the k values in sorted order, add a second sort on just k elements.

Pattern 4: Parallel Image Loading

Problem: Sequential I/O when loading many images.

Before:

python

img_list = [io.imread(os.path.join(src_dir, f)) for f in img_f_list]

After:

python

from concurrent.futures import ThreadPoolExecutor

img_paths = [os.path.join(src_dir, f) for f in img_f_list]
with ThreadPoolExecutor(max_workers=min(8, len(img_paths))) as executor:
    img_list = list(executor.map(io.imread, img_paths))

Speedup: 10-20x for 50+ images

Safe because: Image reads are independent, thread-safe

Pattern 5: Single Directory Scan

Problem: Multiple glob calls scan directory repeatedly.

Before (5 filesystem scans):

python

patterns = ["*.tif", "*.tiff", "*.zarr", "*.ome.tif", "*.ome.tiff"]
for pattern in patterns:
    for f in cycle_dir.glob(pattern):
        process(f)

After (1 filesystem scan):

python

valid_extensions = {".tif", ".tiff", ".zarr"}
valid_suffixes = {".ome.tif", ".ome.tiff"}
for f in cycle_dir.iterdir():
    suffix_lower = f.suffix.lower()
    name_lower = f.name.lower()
    if suffix_lower in valid_extensions or any(name_lower.endswith(s) for s in valid_suffixes):
        process(f)

Speedup: 2-3x for directories with many files

Safe because: Same files matched, just more efficiently

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Caching image processing results	Wrong results for different inputs	Only cache metadata, not computed results
Caching BaSiC illumination profiles	15-20% intensity errors for sparse markers	Each channel needs its own profile (see basic-caching-evaluation skill)
Parallelizing CPU-bound NumPy	GIL contention, no speedup	ThreadPoolExecutor only for I/O-bound tasks
Removing "unnecessary" array copies	Broke downstream code expecting copies	Test thoroughly before removing .copy()

Priority Matrix

Priority	Pattern	Risk	Speedup
P0	Parallel image loading	Low	10-20x
P1	@lru_cache metadata	Low	3-5x
P1	Vectorized overlap	Low	5-10x
P2	argpartition	Low	3-5x
P2	Single directory scan	Low	2-3x
P3	Remove dask copies	Medium	2x memory

When NOT to Apply These Patterns

@lru_cache: Don't cache if function has side effects or file might change
argpartition: Don't use if you need results in sorted order
ThreadPoolExecutor: Don't use for CPU-bound operations (use ProcessPoolExecutor)
Removing copies: Don't remove if downstream code modifies the array

Key Insights

Start with I/O optimizations - biggest wins with lowest risk
Vectorization beats loops in NumPy, always
Profile before optimizing - intuition is often wrong
Test numerical accuracy after optimization, not just correctness

References

NumPy performance tips documentation
Python concurrent.futures documentation
KINTSUGI Performance Audit (2025-12-12)

Search AI Tools

python-performance-patterns

Install this agent skill to your Project

SKILL.md

Python Performance Patterns

Experiment Overview

Context

Pattern 1: @lru_cache for File Metadata

Pattern 2: Vectorized Overlap Matrix

Pattern 3: argpartition for Top-K

Pattern 4: Parallel Image Loading

Pattern 5: Single Directory Scan

Failed Attempts (Critical)

Priority Matrix

When NOT to Apply These Patterns

Key Insights

References