API

Preprocessing

Preprocessing functions are relevant both for preparing the data for integration as well as postprocessing the integration output.

The most relevant preprocessing steps are:

  • Normalization

  • Scaling, batch-aware

  • Highly variable gene selection, batch-aware

  • Cell cycle scoring

  • Principle component analysis (PCA)

  • k-nearest neighbor graph (kNN graph)

  • UMAP

  • Clustering

Note that some preprocessing steps depend on each other. Please refer to the Single Cell Best Practices Book for more details.

normalize(adata[, min_mean, log, ...])

Normalise counts using the scran normalisation method

scale_batch(adata, batch)

Batch-aware scaling of count matrix

hvg_intersect(adata, batch[, target_genes, ...])

Highly variable gene selection

hvg_batch(adata[, batch_key, target_genes, ...])

Batch-aware highly variable gene selection

score_cell_cycle(adata[, organism])

Score cell cycle score given an organism

reduce_data(adata[, batch_key, flavor, ...])

Apply feature selection and dimensionality reduction steps.

Integration

Integration method functions require the preprocessed anndata object (here adata) and the name of the batch column in adata.obs (here 'batch'). The methods can be called using the following, where integration_method is the name of the integration method.

scib.ig.integration_method(adata, batch="batch")

For example, in order to run Scanorama, on a dataset, call:

scib.ig.scanorama(adata, batch="batch")

Warning

The following notation is deprecated.

scib.integration.runIntegrationMethod(adata, batch="batch")

Please use the snake_case naming without the run prefix.

Some integration methods (e.g. scgen(), scanvi()) also use cell type labels as input. For these, you need to additionally provide the corresponding label column of adata.obs (here cell_type).

scib.ig.scgen(adata, batch="batch", cell_type="cell_type")
scib.ig.scanvi(adata, batch="batch", labels="cell_type")

Functions

bbknn(adata, batch[, hvg])

BBKNN wrapper function

combat(adata, batch)

ComBat wrapper function (scanpy implementation)

desc(adata, batch[, res, ncores, tmp_dir, ...])

DESC wrapper function

harmony(adata, batch[, hvg])

Harmony wrapper function

mnn(adata, batch[, hvg])

MNN wrapper function (mnnpy implementation)

saucie(adata, batch)

SAUCIE wrapper function

scanorama(adata, batch[, hvg])

Scanorama wrapper function

scanvi(adata, batch, labels[, hvg, max_epochs])

scANVI wrapper function

scgen(adata, batch, cell_type[, epochs, hvg])

scGen wrapper function

scvi(adata, batch[, hvg, return_model, ...])

scVI wrapper function

trvae(adata, batch[, hvg])

trVAE wrapper function

trvaep(adata, batch[, hvg])

trVAE wrapper function (pytorch implementatioon)

Clustering

After integration, one of the first ways to determine the quality of the integration is to cluster the integrated data and compare the clusters to the original annotations. This is exactly what some of the metrics do.

cluster_optimal_resolution(adata, label_key, ...)

Optimised clustering

get_resolutions([n, min, max])

Get equally spaced resolutions for optimised clustering

opt_louvain(adata, label_key, cluster_key[, ...])

Optimised Louvain clustering

Metrics

This package contains all the metrics used for benchmarking scRNA-seq data integration performance. They can be applied on the integrated as well as the unintegrated data and can be classified into biological conservation and batch removal metrics. For a detailed description of the metrics implemented in this package, please see our publication.

Most metrics require specific inputs that need to be preprocessed, which is described in detail under User Guide.

Biological Conservation Metrics

Biological conservation metrics quantify either the integrity of cluster-based metrics based on clustering results of the integration output, or the difference in the feature spaces of integrated and unintegrated data. Each metric is scaled to a value ranging from 0 to 1 by default, where larger scores represent better conservation of the biological aspect that the metric addresses.

ari(adata, cluster_key, label_key[, ...])

Adjusted Rand Index

cell_cycle(adata_pre, adata_post, batch_key)

Cell cycle conservation score

clisi_graph(adata, label_key, type_[, ...])

Cell-type LISI (cLISI) score

hvg_overlap(adata_pre, adata_post, batch_key)

Highly variable gene overlap

isolated_labels_asw(adata, label_key, ...[, ...])

Isolated label score ASW

isolated_labels_f1(adata, label_key, ...[, ...])

Isolated label score F1

nmi(adata, cluster_key, label_key[, ...])

Normalized mutual information

silhouette(adata, label_key, embed[, ...])

Average silhouette width (ASW)

trajectory_conservation(adata_pre, ...[, ...])

Trajectory conservation score

Batch Correction Metrics

Batch correction metrics values are scaled by default between 0 and 1, in which larger scores represent better batch removal.

graph_connectivity(adata, label_key)

Graph Connectivity

ilisi_graph(adata, batch_key, type_[, ...])

Integration LISI (iLISI) score

kBET(adata, batch_key, label_key, type_[, ...])

kBET score

pcr_comparison(adata_pre, adata_post, covariate)

Principal component regression score

silhouette_batch(adata, batch_key, ...[, ...])

Batch ASW

Metrics Wrapper Functions

For convenience, scib provides wrapper functions that, given integrated and unintegrated adata objects, apply multiple metrics and return all the results in a pandas.Dataframe. The main function is metrics(), that provides all the parameters for the different metrics.

scib.metrics.metrics(adata, adata_int, ari=True, nmi=True)

The remaining functions call the metrics() for

Furthermore, metrics() is wrapped by convenience functions with preconfigured subsets of metrics based on expected computation time:

metrics(adata, adata_int, batch_key, label_key)

Master metrics function

metrics_fast(adata, adata_int, batch_key, ...)

Only metrics with minimal preprocessing and runtime

metrics_slim(adata, adata_int, batch_key, ...)

All metrics apart from kBET and LISI scores

metrics_all(adata, adata_int, batch_key, ...)

All metrics

Auxiliary Functions

Some parts of metrics can be used individually, these are listed below.

cluster_optimal_resolution(adata, label_key, ...)

Optimised clustering

get_resolutions([n, min, max])

Get equally spaced resolutions for optimised clustering

lisi_graph(adata, batch_key, label_key, **kwargs)

cLISI and iLISI scores

pcr(adata, covariate[, embed, n_comps, ...])

Principal component regression for anndata object

pc_regression(data, covariate[, pca_var, ...])

Principal component regression