Utilities and Validation#

Utilities#

Shared internal utilities: logging, formatting, helpers.

sctrial.utils.ensure_unique_index(df: DataFrame, *, agg: str = 'mean') → DataFrame[source]#

If df.index has duplicates, aggregate duplicates and return a new df. :param df: DataFrame to ensure unique index. :param agg: Aggregation method: “mean” or “sum” (extend later if needed).

Returns:: A DataFrame with unique index.
Return type:: pd.DataFrame

sctrial.utils.get_counts_matrix(adata: AnnData) → tuple[np.ndarray | None, str | None][source]#

Return a raw-counts matrix and its source label, if available.

Parameters:: adata – AnnData object.
Returns:: A tuple containing the counts matrix and its source label. - The counts matrix is the raw counts matrix. - The source label is the layer name where the counts matrix is stored. - If no counts matrix is found, returns (None, None).
Return type:: tuple[np.ndarray | None, str | None]

sctrial.utils.intersect_preserve_order(items: Sequence[str], universe: Iterable[str]) → list[str][source]#

Return items that appear in universe, preserving original order.

Parameters:

items – List of items to intersect.
universe – List of items to intersect with.

Returns:

A list of items that appear in universe, preserving original order.

Return type:

list[str]

sctrial.utils.looks_like_counts(X, sample: int = 10000, seed: int = 0) → bool[source]#

Check if matrix appears to be raw counts.

Parameters:

X – Matrix to check.
sample – Number of samples to check.
seed – Random seed.

Returns:

True if matrix appears to be raw counts, False otherwise.

Return type:

bool

sctrial.utils.permutation_pvalue(group1: ndarray, group2: ndarray, n_perm: int = 10000, seed: int = 42) → float[source]#

Two-sample permutation test for difference in means. :param group1: First group of values. :param group2: Second group of values. :param n_perm: Number of permutations. :param seed: Random seed.

Returns:: Two-sided permutation p-value in [0, 1].
Return type:: float

Notes

H0: mean(group1) = mean(group2)

sctrial.utils.permutation_pvalue_paired(x: ndarray, y: ndarray, n_perm: int = 10000, seed: int = 42) → float[source]#

Paired permutation test (sign-flip test) for difference in means.

Parameters:

x – First group of values.
y – Second group of values.
n_perm – Number of permutations.
seed – Random seed.

Returns:

Two-sided permutation p-value in [0, 1].

Return type:

float

Notes

H0: mean(y - x) = 0

sctrial.utils.resolve_feature(adata: AnnData, query: str) → str[source]#

Resolve a feature name in adata.var_names or adata.obs.columns (case-insensitive).

Parameters:

adata – AnnData object.
query – Feature name to resolve.

Returns:

The exact name string to use.

Return type:

str

Raises:

KeyError – If the feature is not found in adata.var_names or adata.obs.columns.
ValueError – If the feature query is an empty string.

sctrial.utils.safe_filename(s: str, maxlen: int = 180) → str[source]#

Return a filesystem-safe filename slug.

Parameters:

s – Input string to sanitize.
maxlen – Maximum length of the output string.

Returns:

A filesystem-safe filename.

Return type:

str

sctrial.utils.wild_cluster_bootstrap_t(fit: RegressionResultsWrapper, X: ndarray, clusters: ndarray, term_name: str, B: int = 999, seed: int = 42, cov_type: str = 'cluster', ci_level: float = 0.95) → BootstrapResult[source]#

Wild cluster bootstrap (Rademacher) for one coefficient.

Notes

Implements a wild cluster bootstrap-t using Rademacher weights at the cluster level. This is recommended when the number of clusters is small and standard cluster-robust inference may be unreliable.

Each bootstrap draw perturbs the restricted residuals (imposing H0: beta_j = 0) with cluster-level Rademacher weights (±1 with equal probability), re-fits the full model via OLS (or WLS when the original fit used weights) with per-iteration cluster-robust SE, and forms a bootstrap t-statistic. The two-sided p-value is the fraction of bootstrap |t*| values that exceed the observed |t|.

Bootstrap confidence intervals use the bootstrap-t (studentized) method: quantiles of the bootstrap t-distribution are applied to the observed point estimate and SE, yielding asymmetry-respecting CIs that are approximately consistent with the bootstrap p-value (Hall, 1992). Note: this is not exact test-inversion; a full inversion CI would require re-running the bootstrap at every candidate null, which is computationally prohibitive. The bootstrap-t is the standard practical approach recommended by Cameron et al. (2008).

Reference: Cameron, A.C., Gelbach, J.B., & Miller, D.L. (2008). Bootstrap-based improvements for inference with clustered errors. The Review of Economics and Statistics, 90(3), 414–427.

Parameters:

fit – Statsmodels regression results (with cluster-robust SE).
X – Design matrix (fit.model.exog).
clusters – Array of cluster IDs.
term_name – Name of the coefficient to test.
B – Number of bootstrap draws.
seed – Random seed.
cov_type – Covariance type for bootstrap refits. Default "cluster" uses cluster-robust SE (Cameron et al. 2008). Use "nonrobust" when participant fixed effects already absorb within-cluster correlation (e.g. participant_visit aggregation with 2 obs per cluster).
ci_level – Confidence level for the bootstrap-t CI (default 0.95 → 95 % CI).

Returns:

Named tuple with p_boot, se_boot, ci_lo, ci_hi, and boot_distribution.

Return type:

BootstrapResult

class sctrial.utils.BootstrapResult(p_boot: float, se_boot: float, ci_lo: float, ci_hi: float, boot_distribution: np.ndarray)[source]

Result of a wild cluster bootstrap procedure.

p_boot

Two-sided bootstrap p-value.

Type:: float

se_boot

Bootstrap standard error (SD of bootstrap coefficient distribution).

Type:: float

ci_lo

Lower bound of the bootstrap-t confidence interval.

Type:: float

ci_hi

Upper bound of the bootstrap-t confidence interval.

Type:: float

boot_distribution

Array of bootstrap coefficient estimates (valid draws only).

Type:: np.ndarray

boot_distribution: ndarray: Alias for field number 4

ci_hi: float: Alias for field number 3

ci_lo: float: Alias for field number 2

p_boot: float: Alias for field number 0

se_boot: float: Alias for field number 1

Validation#

Data validation utilities for trial analysis.

class sctrial.validation.TrialDataValidator[source]#

Bases: object

Comprehensive validation for trial analysis data.

static validate_adata(adata: AnnData, design: TrialDesign, strict: bool = False) → list[str][source]#

Validate AnnData object for trial analysis.

Parameters:

adata – AnnData object to validate.
design – TrialDesign specifying column names.
strict – If True, raises exceptions. If False, returns warnings.

Return type:

List of warning/error messages.

Examples

>>> validator = TrialDataValidator()
>>> issues = validator.validate_adata(adata, design, strict=False)
>>> if issues:
...     print(f"Found {len(issues)} issues:")
...     for issue in issues:
...         print(f"  - {issue}")

static validate_features(adata: AnnData, features: Sequence[str], allow_missing: bool = False) → tuple[list[str], list[str]][source]#

Validate feature names.

Parameters:

adata – AnnData object.
features – List of feature names to validate.
allow_missing – If False, raises error for missing features.

Return type:

Tuple of (valid_features, missing_features).

Examples

>>> valid, missing = TrialDataValidator.validate_features(
...     adata, ["Gene1", "Gene2", "NonExistent"]
... )
>>> print(f"Valid: {valid}, Missing: {missing}")

sctrial.validation.check_covariate_balance(adata: AnnData, design: TrialDesign, covariates: Sequence[str], *, visit: str | None = None, dropna: bool = True, smd_threshold: float = 0.1) → DataFrame[source]#

Compute standardized mean differences (SMD) for baseline covariates.

This compares treated vs control arms at a single visit (usually baseline) and reports SMD values that quantify imbalance.

Parameters:

adata – AnnData object with trial metadata in adata.obs.
design – TrialDesign describing participant, visit, and arm columns.
covariates – List of covariate column names in adata.obs.
visit – Visit label to use for balance checks. If None, uses design.baseline_visit; otherwise raises if not available.
dropna – If True, drop rows with missing covariate values for each covariate.
smd_threshold – Absolute SMD threshold for “balanced” flag (default 0.1).

Returns:

Table with SMD values. Numeric covariates produce one row per covariate. Categorical covariates produce one row per level with proportions.

Return type:

pd.DataFrame

sctrial.validation.diagnose_trial_data(adata: AnnData, design: TrialDesign, verbose: bool = True) → dict[str, Any][source]#

Comprehensive diagnostic report for trial data.

Parameters:

adata – AnnData object to diagnose.
design – TrialDesign object.
verbose – If True, prints diagnostic report.

Returns:

Diagnostic summary with keys including: - n_cells, n_genes, n_participants, n_visits, n_arms - paired_participants (dict of visit pairs -> counts) - cells_per_participant (pd.Series) - warnings (list of strings) - recommendations (list of strings)

Return type:

dict

Examples

>>> diagnostics = diagnose_trial_data(adata, design, verbose=True)
>>> if diagnostics['warnings']:
...     print("Warnings found:")
...     for w in diagnostics['warnings']:
...         print(f"  - {w}")

sctrial.validation.validate_adata(adata: AnnData, design: TrialDesign, strict: bool = False) → list[str][source]#

Validate AnnData object for trial analysis.

Convenience wrapper around TrialDataValidator.validate_adata().

Parameters:

adata – AnnData object to validate.
design – TrialDesign specifying column names.
strict – If True, raises exceptions. If False, returns warnings.

Return type:

List of warning/error messages.

sctrial.validation.validate_features(adata: AnnData, features: Sequence[str], allow_missing: bool = False) → tuple[list[str], list[str]][source]#

Validate feature names.

Convenience wrapper around TrialDataValidator.validate_features().

Parameters:

adata – AnnData object.
features – List of feature names to validate.
allow_missing – If False, raises error for missing features.

Return type:

Tuple of (valid_features, missing_features).