Datasets#
Built-in dataset loaders for clinical trial scRNA-seq cohorts.
- sctrial.datasets.categorize_celltype(ct: str) str[source]#
Map fine-grained cell types to coarse lineages (COVID-19 example).
- Parameters:
ct – Cell type string.
- Returns:
Coarse lineage string.
- Return type:
- sctrial.datasets.count_paired(obs: DataFrame, visit_col: str, visits: Sequence[str], participant_col: str = 'participant_id') int[source]#
Count participants with data at both visits.
- Parameters:
obs – DataFrame containing the participant-visit data.
visit_col – Column name in obs to use for visit labels.
visits – Sequence of visit labels to check (e.g. [“baseline”, “followup”]).
participant_col – Column name in obs to use for participant IDs.
- Returns:
Number of participants with data at both visits.
- Return type:
- Raises:
ValueError – If visits does not contain at least 2 labels (baseline and followup).
- sctrial.datasets.ensure_fdr(df: DataFrame, p_col: str = 'p_time', fdr_col: str = 'FDR_time') DataFrame[source]#
Add Benjamini-Hochberg FDR column for a p-value column.
- Parameters:
df – DataFrame containing the p-value column.
p_col – Column name in df to use for p-value column.
fdr_col – Column name in df to use for FDR-corrected p-value column.
- Returns:
A copy of the DataFrame with the FDR-corrected p-value column added.
- Return type:
pd.DataFrame
- sctrial.datasets.harmonize_response(adata: AnnData, *, force: bool = False) AnnData[source]#
Create a
response_harmonizedcolumn with consistent labels.Maps various responder/non-responder column names and label formats (e.g. “R”/”NR”, “Responder”/”Non-responder”) to a standard vocabulary:
"Responder"and"Non-responder".- Parameters:
adata (AnnData) – Must contain one of:
response,Response, orclinical_responsein.obs.force (bool) – If True, recompute even when the column already exists.
- Returns:
The input AnnData with
response_harmonizedadded to.obs.- Return type:
AnnData
- sctrial.datasets.load_aml(data_dir: str | None = None, processed_name: str = 'gse116256_aml_processed.h5ad', max_cells_per_sample: int | None = None, seed: int = 42, allow_download: bool = False, force_reprocess: bool = False) AnnData[source]#
Load the van Galen AML chemotherapy dataset (GSE116256).
This dataset contains pre/post-chemotherapy bone marrow samples from AML patients with cell-type annotations and treatment-response metadata.
- Parameters:
data_dir (str) – Directory containing (or to store) the data files. Raw files go in
<data_dir>/raw/and the processed cache in<data_dir>/processed/.processed_name (str) – Filename for the cached processed h5ad file.
max_cells_per_sample (int) – Maximum cells to keep per sample after subsampling.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download raw data from GEO when not found locally.
force_reprocess (bool) – If True, reprocess even when a cached file exists.
- Returns:
The processed AnnData object.
- Return type:
AnnData
Notes
The raw data is automatically downloaded from GEO when
allow_download=True: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116256Reference: van Galen et al., Cell 2019.
Examples
>>> adata = sctrial.load_aml(allow_download=True)
- sctrial.datasets.load_cart(data_dir: str | None = None, processed_name: str = 'gse290722_cart_processed.h5ad', max_cells_per_sample: int | None = None, seed: int = 42, allow_download: bool = False, force_reprocess: bool = False) AnnData[source]#
Load the CAR-T cell therapy dataset (GSE290722).
This dataset contains pre/post-CAR-T infusion samples with cell-type annotations and treatment-response metadata from the ZUMA-1 trial.
- Parameters:
data_dir (str) – Directory containing (or to store) the data files. Raw files go in
<data_dir>/raw/and the processed cache in<data_dir>/processed/.processed_name (str) – Filename for the cached processed h5ad file.
max_cells_per_sample (int) – Maximum cells to keep per sample after subsampling.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download raw data from GEO when not found locally.
force_reprocess (bool) – If True, reprocess even when a cached file exists.
- Returns:
The processed AnnData object.
- Return type:
AnnData
Notes
The raw data is automatically downloaded from GEO when
allow_download=True: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE290722Reference: GSE290722 CAR-T therapy dataset (ZUMA-1 trial).
Examples
>>> adata = sctrial.load_cart(allow_download=True)
- sctrial.datasets.load_sade_feldman(data_dir: str | None = None, processed_name: str = 'sade_feldman_processed_v6.h5ad', max_cells_per_participant_visit: int | None = None, seed: int = 42, allow_download: bool = False, force_reprocess: bool = False) AnnData[source]#
Load and preprocess Sade-Feldman melanoma immunotherapy dataset (GSE120575).
- Parameters:
data_dir (str) – Directory containing (or to store) the raw data files.
processed_name (str) – Filename for the cached processed h5ad file.
max_cells_per_participant_visit (int or None) – Maximum number of cells to retain per participant-visit pair.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download missing files from GEO automatically.
force_reprocess (bool) – If True, reprocess even when a cached file exists.
- Returns:
The processed AnnData object.
- Return type:
AnnData
- sctrial.datasets.load_stephenson_data(data_dir: str | None = None, processed_name: str = 'stephenson_covid19_v3.h5ad', seed: int = 42, allow_download: bool = False, force_reprocess: bool = False, *, data_path: str | None = None) AnnData[source]#
Load and preprocess Stephenson COVID-19 dataset (E-MTAB-10026).
- Parameters:
data_dir (str) – Directory containing (or to store) the raw data files.
processed_name (str) – Filename for the cached processed h5ad file.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download the data file automatically when missing.
force_reprocess (bool) – If True, reprocess even when a cached file exists.
data_path (str or None) –
Deprecated since version 0.2.2: Use data_dir instead. When supplied, data_dir is ignored and the parent directory of data_path is used.
- Returns:
The processed AnnData object.
- Return type:
AnnData
- sctrial.datasets.load_vaccine_gse171964(data_dir: str | None = None, processed_name: str = 'vaccine_gse171964.h5ad', max_participants: int | None = None, max_cells_per_group: int | None = None, seed: int = 42, allow_download: bool = False, force_reprocess: bool = False) AnnData[source]#
Load and preprocess GSE171964 PBMC vaccine time course data (Day 0 vs Day 7).
- Parameters:
data_dir (str) – Directory containing (or to store) the raw data files.
processed_name (str) – Filename for the cached processed h5ad file.
max_participants (int or None) – Maximum number of participants to retain.
max_cells_per_group (int or None) – Maximum number of cells per participant-day-celltype group.
seed (int) – Random seed for reproducibility.
allow_download (bool) – If True, download missing files from GEO automatically.
force_reprocess (bool) – If True, reprocess even when a cached file exists.
- Returns:
The processed AnnData object.
- Return type:
AnnData
- sctrial.datasets.verify_paired_participants(obs: DataFrame, visit_col: str, visits: Sequence[str], features: Sequence[str] | None = None, participant_col: str = 'participant_id') dict[source]#
Validate paired participants by visit presence and optional feature completeness.
- Parameters:
obs – DataFrame containing the participant-visit data.
visit_col – Column name in obs to use for visit labels.
visits – Sequence of visit labels to check (e.g. [“baseline”, “followup”]).
features – Sequence of feature names to check.
participant_col – Column name in obs to use for participant IDs.
- Returns:
A dictionary containing the following keys: - paired_ids: set of participant IDs with both visits (and non-NaN features if provided) - dropped_ids: list of participant IDs dropped by validation - n_paired: count of paired_ids - n_total: total unique participants
- Return type: