Datasets#

Built-in dataset loaders for clinical trial scRNA-seq cohorts.

sctrial.datasets.categorize_celltype(ct: str) → str[source]#

Map fine-grained cell types to coarse lineages (COVID-19 example).

Parameters:: ct – Cell type string.
Returns:: Coarse lineage string.
Return type:: str

sctrial.datasets.count_paired(obs: DataFrame, visit_col: str, visits: Sequence[str], participant_col: str = 'participant_id') → int[source]#

Count participants with data at both visits.

Parameters:

obs – DataFrame containing the participant-visit data.
visit_col – Column name in obs to use for visit labels.
visits – Sequence of visit labels to check (e.g. [“baseline”, “followup”]).
participant_col – Column name in obs to use for participant IDs.

Returns:

Number of participants with data at both visits.

Return type:

int

Raises:

ValueError – If visits does not contain at least 2 labels (baseline and followup).

sctrial.datasets.ensure_fdr(df: DataFrame, p_col: str = 'p_time', fdr_col: str = 'FDR_time') → DataFrame[source]#

Add Benjamini-Hochberg FDR column for a p-value column.

Parameters:

df – DataFrame containing the p-value column.
p_col – Column name in df to use for p-value column.
fdr_col – Column name in df to use for FDR-corrected p-value column.

Returns:

A copy of the DataFrame with the FDR-corrected p-value column added.

Return type:

pd.DataFrame

sctrial.datasets.harmonize_response(adata: AnnData, *, force: bool = False) → AnnData[source]#

Create a response_harmonized column with consistent labels.

Maps various responder/non-responder column names and label formats (e.g. “R”/”NR”, “Responder”/”Non-responder”) to a standard vocabulary: "Responder" and "Non-responder".

Parameters:

adata (AnnData) – Must contain one of: response, Response, or clinical_response in .obs.
force (bool) – If True, recompute even when the column already exists.

Returns:

The input AnnData with response_harmonized added to .obs.

Return type:

AnnData

sctrial.datasets.load_aml(data_dir: str | None = None, processed_name: str = 'gse116256_aml_processed.h5ad', max_cells_per_sample: int | None = None, seed: int = 42, allow_download: bool = False, force_reprocess: bool = False) → AnnData[source]#

Load the van Galen AML chemotherapy dataset (GSE116256).

This dataset contains pre/post-chemotherapy bone marrow samples from AML patients with cell-type annotations and treatment-response metadata.

Parameters:

data_dir (str) – Directory containing (or to store) the data files. Raw files go in <data_dir>/raw/ and the processed cache in <data_dir>/processed/.
processed_name (str) – Filename for the cached processed h5ad file.
max_cells_per_sample (int) – Maximum cells to keep per sample after subsampling.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download raw data from GEO when not found locally.
force_reprocess (bool) – If True, reprocess even when a cached file exists.

Returns:

The processed AnnData object.

Return type:

AnnData

Notes

The raw data is automatically downloaded from GEO when allow_download=True: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116256

Reference: van Galen et al., Cell 2019.

Examples

>>> adata = sctrial.load_aml(allow_download=True)

sctrial.datasets.load_cart(data_dir: str | None = None, processed_name: str = 'gse290722_cart_processed.h5ad', max_cells_per_sample: int | None = None, seed: int = 42, allow_download: bool = False, force_reprocess: bool = False) → AnnData[source]#

Load the CAR-T cell therapy dataset (GSE290722).

This dataset contains pre/post-CAR-T infusion samples with cell-type annotations and treatment-response metadata from the ZUMA-1 trial.

Parameters:

data_dir (str) – Directory containing (or to store) the data files. Raw files go in <data_dir>/raw/ and the processed cache in <data_dir>/processed/.
processed_name (str) – Filename for the cached processed h5ad file.
max_cells_per_sample (int) – Maximum cells to keep per sample after subsampling.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download raw data from GEO when not found locally.
force_reprocess (bool) – If True, reprocess even when a cached file exists.

Returns:

The processed AnnData object.

Return type:

AnnData

Notes

The raw data is automatically downloaded from GEO when allow_download=True: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE290722

Reference: GSE290722 CAR-T therapy dataset (ZUMA-1 trial).

Examples

>>> adata = sctrial.load_cart(allow_download=True)

sctrial.datasets.load_sade_feldman(data_dir: str | None = None, processed_name: str = 'sade_feldman_processed_v6.h5ad', max_cells_per_participant_visit: int | None = None, seed: int = 42, allow_download: bool = False, force_reprocess: bool = False) → AnnData[source]#

Load and preprocess Sade-Feldman melanoma immunotherapy dataset (GSE120575).

Parameters:

data_dir (str) – Directory containing (or to store) the raw data files.
processed_name (str) – Filename for the cached processed h5ad file.
max_cells_per_participant_visit (int or None) – Maximum number of cells to retain per participant-visit pair.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download missing files from GEO automatically.
force_reprocess (bool) – If True, reprocess even when a cached file exists.

Returns:

The processed AnnData object.

Return type:

AnnData

sctrial.datasets.load_stephenson_data(data_dir: str | None = None, processed_name: str = 'stephenson_covid19_v3.h5ad', seed: int = 42, allow_download: bool = False, force_reprocess: bool = False, *, data_path: str | None = None) → AnnData[source]#

Load and preprocess Stephenson COVID-19 dataset (E-MTAB-10026).

Parameters:

data_dir (str) – Directory containing (or to store) the raw data files.
processed_name (str) – Filename for the cached processed h5ad file.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download the data file automatically when missing.
force_reprocess (bool) – If True, reprocess even when a cached file exists.
data_path (str or None) –

Deprecated since version 0.2.2: Use data_dir instead. When supplied, data_dir is ignored and the parent directory of data_path is used.

Returns:

The processed AnnData object.

Return type:

AnnData

sctrial.datasets.load_vaccine_gse171964(data_dir: str | None = None, processed_name: str = 'vaccine_gse171964.h5ad', max_participants: int | None = None, max_cells_per_group: int | None = None, seed: int = 42, allow_download: bool = False, force_reprocess: bool = False) → AnnData[source]#

Load and preprocess GSE171964 PBMC vaccine time course data (Day 0 vs Day 7).

Parameters:

data_dir (str) – Directory containing (or to store) the raw data files.
processed_name (str) – Filename for the cached processed h5ad file.
max_participants (int or None) – Maximum number of participants to retain.
max_cells_per_group (int or None) – Maximum number of cells per participant-day-celltype group.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download missing files from GEO automatically.
force_reprocess (bool) – If True, reprocess even when a cached file exists.

Returns:

The processed AnnData object.

Return type:

AnnData

sctrial.datasets.verify_paired_participants(obs: DataFrame, visit_col: str, visits: Sequence[str], features: Sequence[str] | None = None, participant_col: str = 'participant_id') → dict[source]#

Validate paired participants by visit presence and optional feature completeness.

Parameters:

obs – DataFrame containing the participant-visit data.
visit_col – Column name in obs to use for visit labels.
visits – Sequence of visit labels to check (e.g. [“baseline”, “followup”]).
features – Sequence of feature names to check.
participant_col – Column name in obs to use for participant IDs.

Returns:

A dictionary containing the following keys: - paired_ids: set of participant IDs with both visits (and non-NaN features if provided) - dropped_ids: list of participant IDs dropped by validation - n_paired: count of paired_ids - n_total: total unique participants

Return type:

dict