Performance Guide#
This guide covers performance optimization strategies for large-scale single-cell trial analysis.
Performance Overview#
Typical Performance#
On a standard laptop (16GB RAM):
10K cells, 50 features: < 5 seconds
100K cells, 500 features: ~30-60 seconds
500K cells, 2000 features: ~5-10 minutes
Memory requirements:
100K cells: ~2-4 GB
500K cells: ~8-12 GB
1M+ cells: Requires optimization strategies below
Optimization Strategies#
1. Use Pseudobulk Aggregation (Most Important!)#
Always use aggregate="participant_visit" (the default):
# ✓ FAST: Aggregates to ~100-1000 participant-visit combinations
res = st.did_table(
adata,
features=features,
design=design,
visits=("V1", "V2"),
aggregate="participant_visit" # Default
)
Avoid cell-level analysis for large datasets:
# ✗ SLOW: Fits regression on millions of cells
res = st.did_table(
adata,
features=features,
design=design,
aggregate="cell" # Not recommended
)
Performance impact: 10-100x faster with pseudobulk
2. Analyze Gene Sets, Not All Genes#
# ✓ FAST: Test 10-50 gene sets
gene_sets = {
"OXPHOS": ["COX7A1", "ATP5F1A", ...],
"Glycolysis": ["PKM", "LDHA", ...],
# ... 50 total sets
}
adata = st.score_gene_sets(adata, gene_sets, prefix="ms_")
features = [f"ms_{k}" for k in gene_sets.keys()]
# Fast analysis
res = st.did_table(adata, features=features, design=design, visits=("V1", "V2"))
vs.
# ✗ SLOWER: Test all ~20,000 genes
features = adata.var_names.tolist()
res = st.did_table(adata, features=features, design=design, visits=("V1", "V2"))
Performance impact: 100-400x faster with gene sets
3. Use Sparse Matrices#
Ensure counts are stored as sparse matrices:
import scipy.sparse as sp
# Check if sparse
print(f"Is sparse: {sp.issparse(adata.X)}")
# Convert to sparse if needed
if not sp.issparse(adata.layers["counts"]):
adata.layers["counts"] = sp.csr_matrix(adata.layers["counts"])
Memory savings: 10-50x reduction for typical scRNA-seq data
4. Use Built-in Parallelization#
For many features, use did_table_parallel() to distribute work across CPU cores:
res = st.did_table_parallel(
adata,
features=features,
design=design,
visits=("V1", "V2"),
n_jobs=4 # Number of CPU cores
)
Performance impact: Near-linear speedup with core count
Note
Do not subsample or downsample cells to work around memory limits. Subsampling distorts pseudobulk means, alters cell-type composition within participant groups, and changes WLS weights — especially for rare cell types and low-abundance signals. Instead, use sparse storage (strategy 3), feature batching (see below), per-cell-type decomposition, or run on hardware with sufficient memory.
For Large-Scale Analysis#
Workflow for 1M+ Cells#
import sctrial as st
import scanpy as sc
import scipy.sparse as sp
# Step 1: Load data (backed mode to avoid loading X into memory)
adata = sc.read_h5ad("large_dataset.h5ad", backed='r')
print(f"Starting with {adata.n_obs:,} cells")
# Step 2: Bring into memory with sparse storage
adata = adata.to_memory()
if not sp.issparse(adata.X):
adata.X = sp.csr_matrix(adata.X)
# Step 3: Filter low-expression genes to reduce feature space
sc.pp.filter_genes(adata, min_counts=10)
print(f"After gene filtering: {adata.n_vars:,} genes")
# Step 4: Normalize
adata = st.add_log1p_cpm_layer(adata, counts_layer="counts")
# Step 5: Score gene sets (reduces thousands of genes → tens of features)
gene_sets = {...} # Your gene sets
adata = st.score_gene_sets(adata, gene_sets, layer="log1p_cpm", prefix="ms_")
# Step 6: DiD analysis with parallelization
features = [f"ms_{k}" for k in gene_sets.keys()]
res = st.did_table_parallel(
adata,
features=features,
design=design,
visits=("baseline", "followup"),
n_jobs=4
)
print(f"Analysis complete! Tested {len(features)} features.")
Batch Processing#
For genome-wide analysis, process genes in batches:
import pandas as pd
def process_in_batches(adata, all_features, design, visits, batch_size=1000):
"""Process features in batches to limit memory usage."""
results = []
n_batches = int(np.ceil(len(all_features) / batch_size))
print(f"Processing {len(all_features)} features in {n_batches} batches...")
for i in range(0, len(all_features), batch_size):
batch_features = all_features[i:i+batch_size]
print(f"Batch {i//batch_size + 1}/{n_batches}: {len(batch_features)} features")
res_batch = st.did_table(
adata,
features=batch_features,
design=design,
visits=visits
)
results.append(res_batch)
# Optional: Free memory
import gc
gc.collect()
# Combine results
full_res = pd.concat(results, ignore_index=True)
# Recalculate FDR across all features
from statsmodels.stats.multitest import multipletests
mask = full_res["p_DiD"].notna()
full_res["FDR_DiD"] = np.nan
if mask.sum() > 0:
full_res.loc[mask, "FDR_DiD"] = multipletests(
full_res.loc[mask, "p_DiD"],
method="fdr_bh"
)[1]
return full_res
# Use it
all_genes = adata.var_names.tolist()
res = process_in_batches(
adata,
all_genes,
design,
visits=("V1", "V2"),
batch_size=1000
)
Memory Management#
Monitor Memory Usage#
import psutil
import os
def print_memory_usage():
"""Print current memory usage."""
process = psutil.Process(os.getpid())
mem = process.memory_info().rss / 1024 ** 3 # GB
print(f"Memory usage: {mem:.2f} GB")
print_memory_usage() # Before analysis
res = st.did_table(...)
print_memory_usage() # After analysis
Reduce Memory Footprint#
# 1. Use float32 instead of float64
adata.X = adata.X.astype(np.float32)
# 2. Remove unnecessary layers
del adata.layers["unnecessary_layer"]
# 3. Subset to genes of interest before analysis
genes_of_interest = [...] # Your gene list
adata = adata[:, genes_of_interest].copy()
# 4. Free memory explicitly
import gc
del large_object
gc.collect()
Out-of-Core Processing#
For datasets too large to fit in memory, use backed mode:
import scanpy as sc
# Read in backed mode (doesn't load X into memory)
adata = sc.read_h5ad("huge_dataset.h5ad", backed='r')
# Extract only what you need
genes_to_analyze = ["Gene1", "Gene2", ...]
adata_subset = adata[:, genes_to_analyze].to_memory()
# Now analyze the subset
res = st.did_table(adata_subset, features=genes_to_analyze, design=design, visits=("V1", "V2"))
GSEA Performance#
Optimize GSEA Analysis#
# 1. Pre-filter low-expression genes
import scanpy as sc
sc.pp.filter_genes(adata, min_counts=50)
# 2. Reduce permutations for speed (at cost of precision)
res = st.run_gsea_did(
adata,
gene_sets="KEGG_2021_Human",
design=design,
visits=("V1", "V2"),
permutation_num=1000, # Default is 1000; could use 100 for quick test
min_size=15, # Skip very small gene sets
max_size=500 # Skip very large gene sets
)
# 3. Use specific gene set library (faster than "all")
# Instead of all MSigDB, choose specific collection
res = st.run_gsea_did(
adata,
gene_sets="KEGG_2021_Human", # Specific library
design=design,
visits=("V1", "V2")
)
Parallel Processing (Advanced)#
In addition to did_table_parallel() (see strategy 4 above), you can manually
parallelize across cell types:
Parallelize Across Cell Types#
from joblib import Parallel, delayed
def run_did_for_celltype(adata, celltype, features, design, visits):
"""Run DiD for a single cell type."""
try:
res = st.did_table(
adata,
features=features,
design=design,
celltype=celltype,
visits=visits
)
res["celltype"] = celltype
return res
except Exception as e:
print(f"Failed for {celltype}: {e}")
return None
# Parallel execution
celltypes = ["CD4_T", "CD8_T", "Monocytes", "B_cells", "NK"]
results = Parallel(n_jobs=4)(
delayed(run_did_for_celltype)(adata, ct, features, design, ("V1", "V2"))
for ct in celltypes
)
# Combine results
results = [r for r in results if r is not None]
full_res = pd.concat(results, ignore_index=True)
Warning: Ensure your system has enough memory for parallel processes.
Benchmarking#
Benchmark Your Analysis#
import time
# Time your analysis
start = time.time()
res = st.did_table(
adata,
features=features,
design=design,
visits=("V1", "V2")
)
elapsed = time.time() - start
print(f"Analysis completed in {elapsed:.2f} seconds")
print(f"Time per feature: {elapsed/len(features):.3f} seconds")
Profiling#
For detailed performance analysis:
import cProfile
import pstats
# Profile the analysis
profiler = cProfile.Profile()
profiler.enable()
res = st.did_table(adata, features=features, design=design, visits=("V1", "V2"))
profiler.disable()
# Print stats
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 slowest functions
Hardware Recommendations#
For Different Dataset Sizes#
Small datasets (< 100K cells):
CPU: Any modern laptop
RAM: 8 GB
Storage: Local SSD
Medium datasets (100K - 500K cells):
CPU: Desktop with 4+ cores
RAM: 16-32 GB
Storage: SSD recommended
Large datasets (500K - 2M cells):
CPU: Workstation with 8+ cores
RAM: 64-128 GB
Storage: Fast NVMe SSD
Very large datasets (> 2M cells):
Use cluster/cloud computing with sufficient RAM
Process features in batches
Use backed mode + sparse storage
Quick Performance Checklist#
Before running large-scale analysis:
# ✓ 1. Check data format
import scipy.sparse as sp
assert sp.issparse(adata.X), "Convert to sparse!"
# ✓ 2. Use pseudobulk
aggregate = "participant_visit" # Not "cell"!
# ✓ 3. Analyze gene sets, not all genes
n_features = len(features)
assert n_features < 1000, f"Too many features ({n_features}), use gene sets!"
# ✓ 4. Use parallelization for many features
# res = st.did_table_parallel(adata, features, design, visits, n_jobs=4)
# ✓ 5. Monitor memory
print_memory_usage()
# NOW run analysis
res = st.did_table(
adata,
features=features,
design=design,
visits=("V1", "V2"),
aggregate="participant_visit"
)