Frequently Asked Questions#

General Questions#

What is sctrial?#

sctrial is a Python package for trial-aware statistical inference on single-cell RNA-seq data from clinical trials. It implements Difference-in-Differences (DiD) analysis with participant fixed effects, designed specifically for longitudinal randomized trials at single-cell resolution.

What makes sctrial different from standard scRNA-seq tools?#

Traditional tools (like Scanpy’s rank_genes_groups or Seurat’s FindMarkers) don’t account for:

Longitudinal paired data structure
Treatment-by-time interactions (the core of DiD)
Participant-level clustering for valid inference
Clinical trial design elements (crossovers, arms, visits)

sctrial is purpose-built for these experimental designs.

Can I use sctrial with Seurat objects?#

Yes! Convert Seurat to AnnData:

# In R
library(Seurat)
library(SeuratDisk)

# Save as h5Seurat
SaveH5Seurat(seurat_obj, filename = "data.h5Seurat")
Convert("data.h5Seurat", dest = "h5ad")

# In Python
import scanpy as sc
adata = sc.read_h5ad("data.h5ad")

# Continue with sctrial
import sctrial as st
design = st.auto_detect_design(adata)

Can I automatically download example datasets?#

Yes! The built-in data loaders can download directly from GEO or EBI:

import sctrial as st

# Sade-Feldman melanoma immunotherapy (GSE120575)
# Requires scanpy for cell-type annotation: pip install sctrial[plots]
adata = st.load_sade_feldman(allow_download=True)

# Stephenson COVID-19 (E-MTAB-10026)
adata = st.load_stephenson_data(allow_download=True)

# Vaccine PBMC time course (GSE171964)
adata = st.load_vaccine_gse171964(allow_download=True)

# AML clinical trial (GSE116256)
# Requires scanpy: pip install sctrial[plots]
adata = st.load_aml(allow_download=True)

# CAR-T clinical trial (GSE290722)
# Requires scanpy: pip install sctrial[plots]
adata = st.load_cart(allow_download=True)

Downloads are disabled by default (allow_download=False) so nothing is fetched without explicit consent. Only missing files are downloaded; files already on disk are reused.

Note

The Sade-Feldman, AML, and CAR-T loaders perform marker-based cell-type annotation using scanpy (Leiden clustering + Wilcoxon marker scoring). Install the plots extra to enable this: pip install sctrial[plots].

Data Requirements#

What data format does sctrial require?#

An AnnData object with:

Raw counts in adata.layers["counts"] (recommended)
Trial metadata in adata.obs:
- Participant identifiers (e.g., “participant_id”)
- Visit labels (e.g., “visit”)
- Treatment arm (e.g., “arm”) — set arm_col=None for single-arm studies
- Cell-type annotations (optional, e.g., “celltype”)

How many participants do I need?#

As a rough guide:

~4 paired participants per arm is a practical lower bound for estimation, though inference will be imprecise.
~10–15 paired participants per arm generally provides more reliable inference.
~20–30 paired participants per arm is typically needed for adequate power to detect moderate effects (d ≈ 0.5).

For smaller samples, bootstrap inference (use_bootstrap=True) is recommended as a safeguard against anti-conservative cluster-robust standard errors.

Do I need the same number of cells per participant?#

No! DiD aggregates to the participant level, so unequal cell counts are fine. The method uses OLS with cluster-robust standard errors to account for within-participant correlation.

Can I use sctrial with cross-sectional data (no longitudinal measurements)?#

No, DiD requires longitudinal (paired) data. For cross-sectional comparisons, use:

res = st.between_arm_comparison(
    adata,
    visit="endpoint",
    features=features,
    design=design
)

Statistical Questions#

When should I use bootstrap vs. standard errors?#

Cluster-robust standard errors (default) are often adequate when the number of participants per arm is reasonably large (e.g. ~15 or more) and sample sizes are roughly balanced.

Wild Cluster Bootstrap (use_bootstrap=True) is recommended as a safeguard when:

The number of participants per arm is modest (e.g. fewer than ~15)
Sample sizes are unbalanced across arms
More accurate p-values and confidence intervals are needed for publication

# Default: Cluster-robust SEs (fast, adequate for larger samples)
res = st.did_table(adata, features=features, design=design, visits=visits)

# Bootstrap: More accurate for small samples (slower)
res = st.did_table(
    adata,
    features=features,
    design=design,
    visits=visits,
    use_bootstrap=True,
    n_boot=999,  # Use 999 or 1999 for publication
    seed=42      # For reproducibility
)

Mathematical note: Bootstrap resamples at the participant level:

\[\hat{t}_b^* = \frac{\hat{\beta}^* - \hat{\beta}}{\text{SE}(\hat{\beta}^*)}\]

where \(\hat{\beta}^*\) is the bootstrap estimate.

What are the minimum sample size recommendations?#

Approximate guidance (two-arm DiD):

Scenario	Per Arm	Total Participants
Practical lower bound	~4 paired	~8
Adequate for bootstrap	~6–8 paired	~12–16
Generally reliable	~10–15 paired	~20–30
Well-powered (d ≈ 0.5)	~20+ paired	~40+

Single-arm paired designs (e.g. vaccination, single-agent therapy):

Scenario	Paired Participants
Practical lower bound	~4
Adequate for bootstrap	~6–8
Generally reliable	~10–15
Well-powered (d ≈ 0.5)	~14+

These are rules of thumb. Actual requirements depend on effect size, variance structure, and the degree of within-participant correlation.

Power calculation example:

import sctrial as st

# How many participants for 80% power to detect d=0.5?
n_needed = st.sample_size_did(effect_size=0.5, power=0.80)
print(f"Need {n_needed} per arm ({2*n_needed} total)")

# What power do I have with 15 per arm?
power = st.power_did(n_per_group=15, effect_size=0.5)
print(f"Power with 15/arm: {power:.1%}")

# Generate power curve
ns, powers = st.power_curve(n_range=(5, 50), effect_size=0.5)

# For single-arm paired designs (e.g. vaccination studies):
n_paired = st.sample_size_paired(effect_size=0.5, power=0.80)
print(f"Need {n_paired} paired participants")

power = st.power_paired(n_participants=6, effect_size=0.8)
print(f"Power with 6 participants: {power:.1%}")

mde = st.sensitivity_paired(n_participants=6, power=0.80)
print(f"Min detectable effect: {mde:.2f} SD")

Practical considerations:

Single-cell data doesn’t increase statistical power (participants do)
Cell counts help reduce noise, but n_participants drives inference
Small samples → use bootstrap and report effect sizes with CIs
Consider mixed effects models for partial pooling with small n

How should I handle missing data?#

Types of missingness in trials:

Missing visits (participant dropout)
Missing cells (technical failure)
Missing annotations (cell type not assigned)

For missing visits (most critical):

DiD requires data at BOTH visits. Participants with missing visits are dropped:

# Check pairing before analysis
n_paired = st.count_paired(adata.obs, "visit", ["V1", "V2"])
print(f"Paired participants: {n_paired}")

# Diagnose issues
diag = st.diagnose_trial_data(adata, design, verbose=True)

Options for missing data:

Complete case analysis (default, recommended for trials)
- Uses only participants with data at both visits
- Valid if missingness is independent of treatment effect (MCAR/MAR)
Multiple imputation (advanced)
- Not built into sctrial; use external packages
- Only if missingness mechanism is well-understood
Last observation carried forward (NOT recommended)
- Biases results; avoid in clinical trials

For missing cell-type annotations:

# Option 1: Exclude unassigned cells
adata = adata[adata.obs["celltype"] != "Unknown"].copy()

# Option 2: Include as separate category
# (if "Unknown" cells are biologically meaningful)

Diagnosing missingness patterns:

# Check for systematic missingness
import pandas as pd

missing = (
    adata.obs
    .groupby(["arm", "visit"])["participant_id"]
    .nunique()
)
print("Participants per arm × visit:")
print(missing.unstack())

# Check if dropout differs by arm (potential bias!)
dropout_by_arm = adata.obs.groupby("arm")["participant_id"].nunique()

What is Difference-in-Differences (DiD)?#

DiD tests whether the change from baseline to follow-up differs between treatment arms. This accounts for:

Baseline differences between arms
Secular trends affecting both arms equally

Formula: (Treated_Post - Treated_Pre) - (Control_Post - Control_Pre)

The DiD coefficient (beta_DiD) estimates the treatment effect.

Should I use cell-level or participant-level aggregation?#

Always use participant-level (aggregate="participant_visit", which is the default) for valid statistical inference.

Why? Cells from the same participant are not independent, violating regression assumptions. Participant-level aggregation (pseudobulk) respects the experimental design.

When aggregated, if an n_cells column is present, Weighted Least Squares (WLS) is used with inverse-variance weights proportional to the number of cells per group. This gives more weight to participant-visit groups with more cells (lower variance means).

What are Wild Cluster Bootstrap p-values?#

A resampling method that provides more accurate p-values and confidence intervals when you have few participants (< 15). Enable with use_bootstrap=True.

It resamples at the participant level (not cell level) to respect clustering. When enabled, the output DataFrame includes additional columns:

p_DiD_boot: bootstrap p-value
se_DiD_boot: bootstrap standard error
ci_lo_boot / ci_hi_boot: bootstrap-t confidence interval

The CI uses the bootstrap-t method (Hall, 1992), which applies quantiles of the bootstrap t-distribution to the observed point estimate and standard error.

How do I interpret beta_DiD?#

If standardize=True (default), beta_DiD is in standard deviation units (like Cohen’s d).

|beta_DiD| > 0.2: Small effect
|beta_DiD| > 0.5: Medium effect
|beta_DiD| > 0.8: Large effect

Positive beta_DiD = feature increased more (or decreased less) in treated vs. control.

For standardized effect sizes (Cohen’s d or Hedge’s g), use add_effect_sizes_to_did():

res = st.did_table(adata, features=features, design=design, visits=("V1", "V2"))
res = st.add_effect_sizes_to_did(res)

# Columns added: effect_size, effect_size_lower, effect_size_upper,
#                effect_size_interpretation
print(res[["feature", "beta_DiD", "effect_size", "effect_size_interpretation"]])

What FDR threshold should I use?#

Exploratory: FDR < 0.10 or 0.25
Standard: FDR < 0.05
Stringent: FDR < 0.01

Always report FDR-corrected p-values, not raw p-values, when testing multiple features.

How do I use Bayesian DiD?#

Bayesian DiD provides posterior distributions and credible intervals. Requires PyMC (pip install sctrial[bayes]):

res = st.did_table_bayes(
    adata, features=features, design=design,
    visits=("V1", "V2"),
    prior_scale=1.0,    # Scale for coefficient priors
    sigma_scale=1.0,    # Scale for residual SD prior
    draws=2000,
    chains=4,
)

Run a prior predictive check first to ensure priors are reasonable:

check = st.prior_predictive_check(
    adata, features=features, design=design,
    visits=("V1", "V2"),
)
print(f"Priors cover data: {check['prior_covers_data']}")

What is the difference between FDR_DiD and FDR_DiD_global?#

In module_score_did_by_pool(), when fdr_within is set:

FDR_DiD: Corrected within each group (e.g., per module). Useful for exploratory analysis but does not control the overall false discovery rate.
FDR_DiD_global: Corrected across all tests globally. Use this for properly controlled inference.

res = st.module_score_did_by_pool(
    pb, design=design, visits=("V1", "V2"),
    fdr_within="module",
    fdr_global=True,  # default
)

# For publication: use global FDR
sig = res[res["FDR_DiD_global"] < 0.05]

Statistical Assumptions & Limitations#

What assumptions does DiD require?#

DiD relies on several key assumptions for causal interpretation:

Parallel trends: In the absence of treatment, both arms would have followed the same trajectory. This cannot be tested directly with two time points, but can be assessed with pre-treatment data using test_parallel_trends() in multi-timepoint designs.
No anticipation: Participants do not change behaviour before treatment begins. Violated if, e.g., treated participants alter expression profiles at the baseline visit in anticipation of therapy.
SUTVA (Stable Unit Treatment Value Assumption): No interference or spillover between participants. Each participant’s outcome depends only on their own treatment assignment.
No post-treatment confounding: Covariates included in the model should be pre-treatment (baseline) variables. Conditioning on post-treatment variables can introduce bias (see Angrist & Pischke, 2009).

These assumptions are inherent to the DiD framework and cannot be verified by the software alone. Users should assess their plausibility for each study design. Sensitivity analysis (e_value_rr()) can quantify how strong an unmeasured confounder would need to be to explain away the observed effect.

Are there limitations to abundance analysis?#

abundance_did() uses an arcsin-square-root transform on cell-type proportions, which is a standard variance-stabilising transformation for compositional data (Shao, 1999). This approach:

Is simple, interpretable, and well-suited to the DiD framework
Handles zero proportions gracefully (arcsin(0) = 0)
Has been used in single-cell clinical trial analyses (Squair et al., 2021)

However, proportions are inherently compositional: an increase in one cell type forces decreases in others. Fully compositional approaches (e.g., Bayesian multinomial-logit models, scCODA; Büttner et al., 2021) explicitly model this constraint. Consider these alternatives when compositional effects are a primary concern and sample sizes are sufficient.

What are the limitations of survival analysis?#

hazard_regression_with_features() fits per-feature Cox proportional hazards models. Key caveats:

Proportional hazards assumption: The hazard ratio for each feature is assumed constant over time. Violated when treatment effects emerge or wane. Assess with Schoenfeld residuals (external to sctrial).
Separation: When a feature perfectly predicts survival, the Cox model can fail to converge or produce infinite hazard ratios. sctrial detects these cases and returns NaN rather than misleading estimates.
Multiple testing: Testing many features inflates false positives. Always use the FDR-corrected column (FDR) rather than raw p-values.
Not a causal model: Unlike DiD, Cox regression on observational features estimates associations, not causal effects, unless combined with an appropriate causal framework.

Can I use categorical covariates in Bayesian DiD?#

Yes. did_table_bayes() automatically dummy-encodes categorical columns (e.g., sex, batch, site) using one-hot encoding with the first category dropped to avoid multicollinearity with the intercept. You can pass categorical columns directly in the covariates parameter:

res = st.did_table_bayes(
    adata, features, design, visits=("V1", "V2"),
    covariates=["sex", "site"],  # categorical columns are auto-encoded
)

This follows standard Bayesian practice for including indicator variables in linear models (Gelman & Hill, 2006).

Analysis Questions#

Can I test for batch effects?#

Include batch as a covariate:

res = st.did_table(
    adata,
    features=features,
    design=design,
    visits=("V1", "V2"),
    covariates=["batch"]
)

Or use ComBat for batch correction before analysis:

import scanpy as sc
sc.pp.combat(adata, key='batch')

How do I handle multiple time points?#

DiD is designed for 2 time points. For more:

Primary analysis: Compare baseline to primary endpoint

res = st.did_table(adata, features=features, design=design, visits=("Day_0", "Week_12"))

Secondary analyses: Test other visit pairs

# Early response
res_early = st.did_table(adata, features=features, design=design, visits=("Day_0", "Week_2"))

# Late response
res_late = st.did_table(adata, features=features, design=design, visits=("Week_2", "Week_12"))

Can I do dose-response analysis?#

Not directly with DiD. Options:

Pairwise comparisons: Compare each dose to control separately
Treat dose as numeric covariate: Include in covariates parameter
Post-hoc trend test: Use external tools on DiD estimates

What if I have crossover design?#

Specify the crossover column in TrialDesign:

design = st.TrialDesign(
    participant_col="participant_id",
    visit_col="visit",
    arm_col="arm",
    arm_treated="Treatment",
    arm_control="Control",
    crossover_col="crossed_over"  # Boolean column
)

# Exclude crossovers from primary analysis
res = st.did_table(
    adata,
    features=features,
    design=design,
    visits=("V1", "V2"),
    exclude_crossovers=True  # Recommended
)

How do I test for cell-type-specific effects?#

res = st.did_table_by_celltype(
    adata,
    features=features,
    design=design,
    visits=("V1", "V2")
)

# Results stratified by cell type
print(res.groupby("celltype")["FDR_DiD"].min())

How do I harmonize response labels across datasets?#

Use harmonize_response() to standardize different response label conventions:

import sctrial as st

adata = st.harmonize_response(adata)
# Maps R/NR, Responder/Non-responder, etc. → "Responder"/"Non-responder"
print(adata.obs["response_harmonized"].value_counts())

Feature and Gene Set Questions#

Can I test individual genes or do I need gene sets?#

Both work:

Gene sets (recommended for robustness):

gene_sets = {"OXPHOS": ["COX7A1", "ATP5F1A", ...]}
adata = st.score_gene_sets(adata, gene_sets, prefix="ms_")
features = ["ms_OXPHOS"]

Individual genes:

features = ["CD3D", "CD8A", "IL2RA"]

For genome-wide tests, use stringent FDR thresholds (e.g., 0.01).

What if some genes in my gene set are missing?#

score_gene_sets() automatically handles this:

Uses only genes present in adata.var_names; missing genes are silently dropped from the set.
A warning is logged when any genes are missing (at INFO level) or when the overlap falls below min_genes (at WARNING level).
By default, min_genes=3 — at least 3 genes must overlap to compute a score. If fewer are found, the score is set to NaN.
Duplicate gene names within a set are automatically removed.

gene_sets = {"MySet": ["GENE1", "GENE2_MISSING", "GENE3"]}
# Will score using GENE1 and GENE3, skip GENE2_MISSING
# (provided min_genes <= 2; default min_genes=3 would yield NaN here)

Should I use mean or zmean for gene set scoring?#

zmean (recommended):

Z-normalizes each gene before averaging
Robust to genes with different expression levels
Better for cross-dataset comparisons

mean:

Simple average
Use when all genes have similar dynamic range

# Recommended
adata = st.score_gene_sets(adata, gene_sets, method="zmean")

How do I incorporate pathway databases?#

Use with GSEA integration:

# Run DiD-aware GSEA
res = st.run_gsea_did(
    adata,
    gene_sets="KEGG_2021_Human",  # Or MSigDB, GO, etc.
    design=design,
    visits=("V1", "V2"),
    rank_by="signed_confidence"
)

# Or custom gene sets
from gseapy import get_library_name
print(get_library_name())  # See available databases

Technical Questions#

Why am I getting NaN results?#

Common causes:

Too few participants (< 4 paired)
Zero variance in feature
Missing data in required columns
Wrong visit labels

Run diagnostics:

diag = st.diagnose_trial_data(adata, design, verbose=True)

Can I run sctrial in parallel?#

Yes! Use did_table_parallel for parallel DiD across many features:

res = st.did_table_parallel(
    adata,
    features=list(adata.var_names),
    design=design,
    visits=("V1", "V2"),
    n_jobs=4,  # Number of parallel workers
)

For smaller gene sets, the standard did_table is often fast enough. You can also process in batches manually:

import pandas as pd

features = list(adata.var_names)
batch_size = 1000

results = []
for i in range(0, len(features), batch_size):
    batch = features[i:i+batch_size]
    res = st.did_table(adata, features=batch, design=design, visits=("V1", "V2"))
    results.append(res)

full_res = pd.concat(results)

How do I save and reload results?#

# Save results
res.to_csv("did_results.csv", index=False)
adata.write_h5ad("processed_adata.h5ad")

# Reload
import pandas as pd
import scanpy as sc

res = pd.read_csv("did_results.csv")
adata = sc.read_h5ad("processed_adata.h5ad")

Can I use sctrial with other omics (ATAC-seq, proteomics)?#

Yes! As long as your data is in AnnData format with appropriate trial metadata.

For ATAC-seq: Use peak accessibility as features For proteomics: Use protein abundance as features

The statistical framework is the same.

Integration Questions#

Can I combine sctrial with Scanpy?#

Yes! Use Scanpy for QC, normalization, and clustering:

import scanpy as sc
import sctrial as st

# Scanpy workflow
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.leiden(adata)

# sctrial workflow
design = st.TrialDesign(...)
res = st.did_table(adata, features=..., design=design, visits=("V1", "V2"))

How do I visualize results in UMAP space?#

import scanpy as sc

# Compute UMAP (if not already done)
sc.tl.umap(adata)

# Visualize by trial design
st.plot_trial_umap_panel(
    adata,
    feature="ms_OXPHOS",
    design=design,
    visits=("V1", "V2")
)

Still have questions?#

Check the Tutorials for step-by-step examples
Browse Troubleshooting Guide for common issues
Open an issue on GitHub