Scoring and Preprocessing#

Gene-Set Scoring#

Gene-set scoring: z-mean, Seurat-style, and AUCell methods.

sctrial.scoring.score_gene_sets(adata: AnnData, gene_sets: dict[str, list[str]], *, layer: str | None = None, method: Literal['zmean', 'mean'] = 'zmean', prefix: str = '', min_genes: int = 3, overwrite: bool = True) AnnData[source]#

Score gene sets and store results in adata.obs.

Parameters:
  • adata – AnnData object containing expression data.

  • gene_sets – Dictionary mapping set names to lists of gene names. Each value must be a list (not a bare string). Duplicate gene names within a set are automatically removed.

  • layer – Expression matrix source. If None, uses adata.X. For log1p-CPM workflows, use layer=”log1p_cpm”.

  • method

    Scoring method:

    • ”mean”: mean expression across genes.

    • ”zmean”: z-score each gene across cells (within the current AnnData), then average z-scores across genes. This is the recommended method as it accounts for different expression scales across genes.

  • prefix – Prefix to add to column names (e.g., ms_ for module scores).

  • min_genes – Minimum number of genes from the set that must be present in the data. If fewer genes overlap, the score is set to NaN and a warning is logged. Default is 3.

  • overwrite – If False, skip gene sets that already have a column in adata.obs.

Returns:

The input AnnData with new columns added to obs.

Return type:

AnnData

Notes

Zero-variance gene handling (zmean method): Genes with zero or near-zero variance (std < 1e-12) are excluded from the z-mean calculation. If ALL genes in a set have zero variance, the score is NaN. This prevents division by zero and ensures meaningful scores.

The zmean method computes: mean(z_i) where z_i = (x_i - mean(x_i)) / std(x_i) for each gene i across all cells.

Non-finite expression values: NaN and inf values in the expression matrix are excluded from score computation (treated as missing). A warning is logged when non-finite values are detected. If all values for a cell are non-finite the resulting score will be NaN.

sctrial.scoring.score_gene_sets_aucell(adata: AnnData, gene_sets: dict[str, list[str]] | dict[str, GeneSignature], *, layer: str | None = None, prefix: str = 'aucell_', overwrite: bool = False) AnnData[source]#

Score gene sets using AUCell (pySCENIC).

Requires pyscenic to be installed.

Parameters:
  • adata – AnnData object containing expression data.

  • gene_sets – Dictionary mapping set names to lists of genes (or GeneSignature objects).

  • layer – Expression layer to use. If None, uses adata.X.

  • prefix – Prefix to add to output columns (default: aucell_).

  • overwrite – If False, skip sets that already exist in adata.obs.

Returns:

The input AnnData with AUCell scores added to adata.obs.

Return type:

AnnData

Preprocessing#

sctrial.preprocessing.add_log1p_cpm_layer(adata: AnnData, *, counts_layer: str | None = 'counts', out_layer: str = 'log1p_cpm', layer_out: str | None = None, scale: float = 1000000.0, overwrite: bool = False, inplace: bool = True) AnnData[source]#

Add log1p(CPM) normalization as a layer.

Parameters:
  • adata – AnnData object with raw counts.

  • counts_layer – Layer name containing raw counts. If None, uses adata.X.

  • out_layer – Output layer name.

  • layer_out – Backwards-compatible alias for out_layer.

  • scale – CPM scale factor (default 1e6). Must be finite and positive.

  • overwrite – Overwrite if out_layer already exists.

  • inplace – Modify the input AnnData if True, else return a copy.

Returns:

The AnnData object with the new log1p(CPM) layer added.

Return type:

AnnData

Raises:
  • ValueError – If scale is not a finite positive number, or if the counts matrix contains negative values.

  • KeyError – If counts_layer is not found in adata.layers.

AnnData Tools#

AnnData manipulation utilities: subsetting, merging, pseudobulk helpers.

sctrial.adata_tools.profile_features(adata: AnnData, features: Sequence[str], groupby: str, layer: str | None = None, agg: str = 'mean') DataFrame[source]#

Calculate aggregate expression of features across groups.

Useful for profiling marker sets across clusters or trial arms.

Parameters:
  • adata – AnnData object.

  • features – Genes or obs columns to aggregate.

  • groupby – Column in adata.obs to group by.

  • layer – Expression layer to use for genes.

  • agg – Aggregation function (‘mean’, ‘median’, etc. supported by pandas).

Returns:

Table with index groupby and columns features, containing aggregated feature values.

Return type:

pd.DataFrame

Notes

Aggregation is performed at the cell level. When grouping by treatment arm or participant, groups with more cells will dominate the aggregate. For participant-level or balanced comparisons, use pseudobulk_expression() or pre-aggregate to pseudobulk before calling this function.

sctrial.adata_tools.subset_cells(adata: AnnData, design: TrialDesign, arm: str | None = None, visit: str | None = None, celltype: str | None = None, exclude_crossovers: bool = False) AnnData[source]#

General-purpose subsetting helper by arm/visit/celltype (+ optional crossover exclusion).

Parameters:
  • adata – AnnData object.

  • design – TrialDesign object.

  • arm – Arm to subset by.

  • visit – Visit to subset by.

  • celltype – Celltype to subset by.

  • exclude_crossovers – If True, exclude crossovers.

Returns:

Subsetted AnnData object.

Return type:

AnnData

sctrial.adata_tools.subset_primary(adata: AnnData, design: TrialDesign, visits: tuple[str, str], exclude_crossovers: bool = True) AnnData[source]#

Subset AnnData to the primary (baseline, followup) visits.

Parameters:
  • visits – Tuple of (baseline_visit, followup_visit), e.g. (“3/T0”, “6/T12w”).

  • exclude_crossovers – If True and design.crossover_col is provided, drop rows where crossover_col is truthy.

Returns:

Subsetted AnnData object.

Return type:

AnnData