Scoring and Preprocessing#
Gene-Set Scoring#
Gene-set scoring: z-mean, Seurat-style, and AUCell methods.
- sctrial.scoring.score_gene_sets(adata: AnnData, gene_sets: dict[str, list[str]], *, layer: str | None = None, method: Literal['zmean', 'mean'] = 'zmean', prefix: str = '', min_genes: int = 3, overwrite: bool = True) AnnData[source]#
Score gene sets and store results in adata.obs.
- Parameters:
adata – AnnData object containing expression data.
gene_sets – Dictionary mapping set names to lists of gene names. Each value must be a
list(not a bare string). Duplicate gene names within a set are automatically removed.layer – Expression matrix source. If None, uses adata.X. For log1p-CPM workflows, use layer=”log1p_cpm”.
method –
Scoring method:
”mean”: mean expression across genes.
”zmean”: z-score each gene across cells (within the current AnnData), then average z-scores across genes. This is the recommended method as it accounts for different expression scales across genes.
prefix – Prefix to add to column names (e.g.,
ms_for module scores).min_genes – Minimum number of genes from the set that must be present in the data. If fewer genes overlap, the score is set to NaN and a warning is logged. Default is 3.
overwrite – If False, skip gene sets that already have a column in adata.obs.
- Returns:
The input AnnData with new columns added to obs.
- Return type:
AnnData
Notes
Zero-variance gene handling (zmean method): Genes with zero or near-zero variance (std < 1e-12) are excluded from the z-mean calculation. If ALL genes in a set have zero variance, the score is NaN. This prevents division by zero and ensures meaningful scores.
The zmean method computes: mean(z_i) where z_i = (x_i - mean(x_i)) / std(x_i) for each gene i across all cells.
Non-finite expression values: NaN and inf values in the expression matrix are excluded from score computation (treated as missing). A warning is logged when non-finite values are detected. If all values for a cell are non-finite the resulting score will be NaN.
- sctrial.scoring.score_gene_sets_aucell(adata: AnnData, gene_sets: dict[str, list[str]] | dict[str, GeneSignature], *, layer: str | None = None, prefix: str = 'aucell_', overwrite: bool = False) AnnData[source]#
Score gene sets using AUCell (pySCENIC).
Requires pyscenic to be installed.
- Parameters:
adata – AnnData object containing expression data.
gene_sets – Dictionary mapping set names to lists of genes (or GeneSignature objects).
layer – Expression layer to use. If None, uses adata.X.
prefix – Prefix to add to output columns (default:
aucell_).overwrite – If False, skip sets that already exist in adata.obs.
- Returns:
The input AnnData with AUCell scores added to adata.obs.
- Return type:
AnnData
Preprocessing#
- sctrial.preprocessing.add_log1p_cpm_layer(adata: AnnData, *, counts_layer: str | None = 'counts', out_layer: str = 'log1p_cpm', layer_out: str | None = None, scale: float = 1000000.0, overwrite: bool = False, inplace: bool = True) AnnData[source]#
Add log1p(CPM) normalization as a layer.
- Parameters:
adata – AnnData object with raw counts.
counts_layer – Layer name containing raw counts. If None, uses adata.X.
out_layer – Output layer name.
layer_out – Backwards-compatible alias for out_layer.
scale – CPM scale factor (default 1e6). Must be finite and positive.
overwrite – Overwrite if out_layer already exists.
inplace – Modify the input AnnData if True, else return a copy.
- Returns:
The AnnData object with the new log1p(CPM) layer added.
- Return type:
AnnData
- Raises:
ValueError – If
scaleis not a finite positive number, or if the counts matrix contains negative values.KeyError – If
counts_layeris not found inadata.layers.
AnnData Tools#
AnnData manipulation utilities: subsetting, merging, pseudobulk helpers.
- sctrial.adata_tools.profile_features(adata: AnnData, features: Sequence[str], groupby: str, layer: str | None = None, agg: str = 'mean') DataFrame[source]#
Calculate aggregate expression of features across groups.
Useful for profiling marker sets across clusters or trial arms.
- Parameters:
adata – AnnData object.
features – Genes or obs columns to aggregate.
groupby – Column in adata.obs to group by.
layer – Expression layer to use for genes.
agg – Aggregation function (‘mean’, ‘median’, etc. supported by pandas).
- Returns:
Table with index groupby and columns features, containing aggregated feature values.
- Return type:
pd.DataFrame
Notes
Aggregation is performed at the cell level. When grouping by treatment arm or participant, groups with more cells will dominate the aggregate. For participant-level or balanced comparisons, use
pseudobulk_expression()or pre-aggregate to pseudobulk before calling this function.
- sctrial.adata_tools.subset_cells(adata: AnnData, design: TrialDesign, arm: str | None = None, visit: str | None = None, celltype: str | None = None, exclude_crossovers: bool = False) AnnData[source]#
General-purpose subsetting helper by arm/visit/celltype (+ optional crossover exclusion).
- Parameters:
adata – AnnData object.
design – TrialDesign object.
arm – Arm to subset by.
visit – Visit to subset by.
celltype – Celltype to subset by.
exclude_crossovers – If True, exclude crossovers.
- Returns:
Subsetted AnnData object.
- Return type:
AnnData
- sctrial.adata_tools.subset_primary(adata: AnnData, design: TrialDesign, visits: tuple[str, str], exclude_crossovers: bool = True) AnnData[source]#
Subset AnnData to the primary (baseline, followup) visits.
- Parameters:
visits – Tuple of (baseline_visit, followup_visit), e.g. (“3/T0”, “6/T12w”).
exclude_crossovers – If True and design.crossover_col is provided, drop rows where crossover_col is truthy.
- Returns:
Subsetted AnnData object.
- Return type:
AnnData