Utilities and Validation#
Utilities#
Shared internal utilities: logging, formatting, helpers.
- sctrial.utils.ensure_unique_index(df: DataFrame, *, agg: str = 'mean') DataFrame[source]#
If df.index has duplicates, aggregate duplicates and return a new df. :param df: DataFrame to ensure unique index. :param agg: Aggregation method: “mean” or “sum” (extend later if needed).
- Returns:
A DataFrame with unique index.
- Return type:
pd.DataFrame
- sctrial.utils.get_counts_matrix(adata: AnnData) tuple[np.ndarray | None, str | None][source]#
Return a raw-counts matrix and its source label, if available.
- Parameters:
adata – AnnData object.
- Returns:
A tuple containing the counts matrix and its source label. - The counts matrix is the raw counts matrix. - The source label is the layer name where the counts matrix is stored. - If no counts matrix is found, returns (None, None).
- Return type:
- sctrial.utils.intersect_preserve_order(items: Sequence[str], universe: Iterable[str]) list[str][source]#
Return items that appear in universe, preserving original order.
- sctrial.utils.looks_like_counts(X, sample: int = 10000, seed: int = 0) bool[source]#
Check if matrix appears to be raw counts.
- Parameters:
X – Matrix to check.
sample – Number of samples to check.
seed – Random seed.
- Returns:
True if matrix appears to be raw counts, False otherwise.
- Return type:
- sctrial.utils.permutation_pvalue(group1: ndarray, group2: ndarray, n_perm: int = 10000, seed: int = 42) float[source]#
Two-sample permutation test for difference in means. :param group1: First group of values. :param group2: Second group of values. :param n_perm: Number of permutations. :param seed: Random seed.
- Returns:
Two-sided permutation p-value in
[0, 1].- Return type:
Notes
H0: mean(group1) = mean(group2)
- sctrial.utils.permutation_pvalue_paired(x: ndarray, y: ndarray, n_perm: int = 10000, seed: int = 42) float[source]#
Paired permutation test (sign-flip test) for difference in means.
- Parameters:
x – First group of values.
y – Second group of values.
n_perm – Number of permutations.
seed – Random seed.
- Returns:
Two-sided permutation p-value in
[0, 1].- Return type:
Notes
H0: mean(y - x) = 0
- sctrial.utils.resolve_feature(adata: AnnData, query: str) str[source]#
Resolve a feature name in adata.var_names or adata.obs.columns (case-insensitive).
- Parameters:
adata – AnnData object.
query – Feature name to resolve.
- Returns:
The exact name string to use.
- Return type:
- Raises:
KeyError – If the feature is not found in adata.var_names or adata.obs.columns.
ValueError – If the feature query is an empty string.
- sctrial.utils.safe_filename(s: str, maxlen: int = 180) str[source]#
Return a filesystem-safe filename slug.
- Parameters:
s – Input string to sanitize.
maxlen – Maximum length of the output string.
- Returns:
A filesystem-safe filename.
- Return type:
- sctrial.utils.wild_cluster_bootstrap_t(fit: RegressionResultsWrapper, X: ndarray, clusters: ndarray, term_name: str, B: int = 999, seed: int = 42, cov_type: str = 'cluster', ci_level: float = 0.95) BootstrapResult[source]#
Wild cluster bootstrap (Rademacher) for one coefficient.
Notes
Implements a wild cluster bootstrap-t using Rademacher weights at the cluster level. This is recommended when the number of clusters is small and standard cluster-robust inference may be unreliable.
Each bootstrap draw perturbs the restricted residuals (imposing H0: beta_j = 0) with cluster-level Rademacher weights (±1 with equal probability), re-fits the full model via OLS (or WLS when the original fit used weights) with per-iteration cluster-robust SE, and forms a bootstrap t-statistic. The two-sided p-value is the fraction of bootstrap |t*| values that exceed the observed |t|.
Bootstrap confidence intervals use the bootstrap-t (studentized) method: quantiles of the bootstrap t-distribution are applied to the observed point estimate and SE, yielding asymmetry-respecting CIs that are approximately consistent with the bootstrap p-value (Hall, 1992). Note: this is not exact test-inversion; a full inversion CI would require re-running the bootstrap at every candidate null, which is computationally prohibitive. The bootstrap-t is the standard practical approach recommended by Cameron et al. (2008).
Reference: Cameron, A.C., Gelbach, J.B., & Miller, D.L. (2008). Bootstrap-based improvements for inference with clustered errors. The Review of Economics and Statistics, 90(3), 414–427.
- Parameters:
fit – Statsmodels regression results (with cluster-robust SE).
X – Design matrix (fit.model.exog).
clusters – Array of cluster IDs.
term_name – Name of the coefficient to test.
B – Number of bootstrap draws.
seed – Random seed.
cov_type – Covariance type for bootstrap refits. Default
"cluster"uses cluster-robust SE (Cameron et al. 2008). Use"nonrobust"when participant fixed effects already absorb within-cluster correlation (e.g. participant_visit aggregation with 2 obs per cluster).ci_level – Confidence level for the bootstrap-t CI (default 0.95 → 95 % CI).
- Returns:
Named tuple with
p_boot,se_boot,ci_lo,ci_hi, andboot_distribution.- Return type:
BootstrapResult
- class sctrial.utils.BootstrapResult(p_boot: float, se_boot: float, ci_lo: float, ci_hi: float, boot_distribution: np.ndarray)[source]
Result of a wild cluster bootstrap procedure.
- p_boot
Two-sided bootstrap p-value.
- Type:
- se_boot
Bootstrap standard error (SD of bootstrap coefficient distribution).
- Type:
- ci_lo
Lower bound of the bootstrap-t confidence interval.
- Type:
- ci_hi
Upper bound of the bootstrap-t confidence interval.
- Type:
- boot_distribution
Array of bootstrap coefficient estimates (valid draws only).
- Type:
np.ndarray
- boot_distribution: ndarray
Alias for field number 4
- ci_hi: float
Alias for field number 3
- ci_lo: float
Alias for field number 2
- p_boot: float
Alias for field number 0
- se_boot: float
Alias for field number 1
Validation#
Data validation utilities for trial analysis.
- class sctrial.validation.TrialDataValidator[source]#
Bases:
objectComprehensive validation for trial analysis data.
- static validate_adata(adata: AnnData, design: TrialDesign, strict: bool = False) list[str][source]#
Validate AnnData object for trial analysis.
- Parameters:
adata – AnnData object to validate.
design – TrialDesign specifying column names.
strict – If True, raises exceptions. If False, returns warnings.
- Return type:
List of warning/error messages.
Examples
>>> validator = TrialDataValidator() >>> issues = validator.validate_adata(adata, design, strict=False) >>> if issues: ... print(f"Found {len(issues)} issues:") ... for issue in issues: ... print(f" - {issue}")
- static validate_features(adata: AnnData, features: Sequence[str], allow_missing: bool = False) tuple[list[str], list[str]][source]#
Validate feature names.
- Parameters:
adata – AnnData object.
features – List of feature names to validate.
allow_missing – If False, raises error for missing features.
- Return type:
Tuple of (valid_features, missing_features).
Examples
>>> valid, missing = TrialDataValidator.validate_features( ... adata, ["Gene1", "Gene2", "NonExistent"] ... ) >>> print(f"Valid: {valid}, Missing: {missing}")
- sctrial.validation.check_covariate_balance(adata: AnnData, design: TrialDesign, covariates: Sequence[str], *, visit: str | None = None, dropna: bool = True, smd_threshold: float = 0.1) DataFrame[source]#
Compute standardized mean differences (SMD) for baseline covariates.
This compares treated vs control arms at a single visit (usually baseline) and reports SMD values that quantify imbalance.
- Parameters:
adata – AnnData object with trial metadata in
adata.obs.design – TrialDesign describing participant, visit, and arm columns.
covariates – List of covariate column names in
adata.obs.visit – Visit label to use for balance checks. If None, uses
design.baseline_visit; otherwise raises if not available.dropna – If True, drop rows with missing covariate values for each covariate.
smd_threshold – Absolute SMD threshold for “balanced” flag (default 0.1).
- Returns:
Table with SMD values. Numeric covariates produce one row per covariate. Categorical covariates produce one row per level with proportions.
- Return type:
pd.DataFrame
- sctrial.validation.diagnose_trial_data(adata: AnnData, design: TrialDesign, verbose: bool = True) dict[str, Any][source]#
Comprehensive diagnostic report for trial data.
- Parameters:
adata – AnnData object to diagnose.
design – TrialDesign object.
verbose – If True, prints diagnostic report.
- Returns:
Diagnostic summary with keys including: - n_cells, n_genes, n_participants, n_visits, n_arms - paired_participants (dict of visit pairs -> counts) - cells_per_participant (pd.Series) - warnings (list of strings) - recommendations (list of strings)
- Return type:
Examples
>>> diagnostics = diagnose_trial_data(adata, design, verbose=True) >>> if diagnostics['warnings']: ... print("Warnings found:") ... for w in diagnostics['warnings']: ... print(f" - {w}")
- sctrial.validation.validate_adata(adata: AnnData, design: TrialDesign, strict: bool = False) list[str][source]#
Validate AnnData object for trial analysis.
Convenience wrapper around TrialDataValidator.validate_adata().
- Parameters:
adata – AnnData object to validate.
design – TrialDesign specifying column names.
strict – If True, raises exceptions. If False, returns warnings.
- Return type:
List of warning/error messages.
- sctrial.validation.validate_features(adata: AnnData, features: Sequence[str], allow_missing: bool = False) tuple[list[str], list[str]][source]#
Validate feature names.
Convenience wrapper around TrialDataValidator.validate_features().
- Parameters:
adata – AnnData object.
features – List of feature names to validate.
allow_missing – If False, raises error for missing features.
- Return type:
Tuple of (valid_features, missing_features).