API

Preprocessing

Preprocessing functions are relevant both for preparing the data for integration as well as postprocessing the integration output.

The most relevant preprocessing steps are:

Normalization
Scaling, batch-aware
Highly variable gene selection, batch-aware
Cell cycle scoring
Principle component analysis (PCA)
k-nearest neighbor graph (kNN graph)
UMAP
Clustering

Note that some preprocessing steps depend on each other. Please refer to the Single Cell Best Practices Book for more details.

`normalize`(adata[, min_mean, log, ...])	Normalise counts using the `scran` normalisation method
`scale_batch`(adata, batch)	Batch-aware scaling of count matrix
`hvg_intersect`(adata, batch[, target_genes, ...])	Highly variable gene selection
`hvg_batch`(adata[, batch_key, target_genes, ...])	Batch-aware highly variable gene selection
`score_cell_cycle`(adata[, organism])	Score cell cycle score given an organism
`get_cell_cycle_genes`(organism)	Get cell cycle genes for a given organism
`reduce_data`(adata[, batch_key, flavor, ...])	Apply feature selection and dimensionality reduction steps.

Integration

Integration method functions require the preprocessed anndata object (here adata) and the name of the batch column in adata.obs (here 'batch'). The methods can be called using the following, where integration_method is the name of the integration method.

scib.ig.integration_method(adata, batch="batch")

For example, in order to run Scanorama, on a dataset, call:

scib.ig.scanorama(adata, batch="batch")

Warning

The following notation is deprecated.

scib.integration.runIntegrationMethod(adata, batch="batch")

Please use the snake_case naming without the run prefix.

Some integration methods (e.g. scgen(), scanvi()) also use cell type labels as input. For these, you need to additionally provide the corresponding label column of adata.obs (here cell_type).

scib.ig.scgen(adata, batch="batch", cell_type="cell_type")
scib.ig.scanvi(adata, batch="batch", labels="cell_type")

Functions

`bbknn`(adata, batch[, hvg])	BBKNN wrapper function
`combat`(adata, batch)	ComBat wrapper function (`scanpy` implementation)
`desc`(adata, batch[, res, ncores, tmp_dir, ...])	DESC wrapper function
`harmony`(adata, batch[, hvg])	Harmony wrapper function
`mnn`(adata, batch[, hvg])	MNN wrapper function (`mnnpy` implementation)
`saucie`(adata, batch)	SAUCIE wrapper function
`scanorama`(adata, batch[, hvg])	Scanorama wrapper function
`scanvi`(adata, batch, labels[, hvg, max_epochs])	scANVI wrapper function
`scgen`(adata, batch, cell_type[, epochs, hvg])	scGen wrapper function
`scvi`(adata, batch[, hvg, return_model, ...])	scVI wrapper function
`trvae`(adata, batch[, hvg])	trVAE wrapper function
`trvaep`(adata, batch[, hvg])	trVAE wrapper function (`pytorch` implementatioon)

Clustering

After integration, one of the first ways to determine the quality of the integration is to cluster the integrated data and compare the clusters to the original annotations. This is exactly what some of the metrics do.

`cluster_optimal_resolution`(adata, label_key, ...)	Optimised clustering
`get_resolutions`([n, min, max])	Get equally spaced resolutions for optimised clustering
`opt_louvain`(adata, label_key, cluster_key[, ...])	Optimised Louvain clustering

Metrics

This package contains all the metrics used for benchmarking scRNA-seq data integration performance. They can be applied on the integrated as well as the unintegrated data and can be classified into biological conservation and batch removal metrics. For a detailed description of the metrics implemented in this package, please see our publication.

Most metrics require specific inputs that need to be preprocessed, which is described in detail under User Guide.

Biological Conservation Metrics

Biological conservation metrics quantify either the integrity of cluster-based metrics based on clustering results of the integration output, or the difference in the feature spaces of integrated and unintegrated data. Each metric is scaled to a value ranging from 0 to 1 by default, where larger scores represent better conservation of the biological aspect that the metric addresses.

`ari`(adata, cluster_key, label_key[, ...])	Adjusted Rand Index
`cell_cycle`(adata_pre, adata_post, batch_key)	Cell cycle conservation score
`clisi_graph`(adata, label_key, type_[, ...])	Cell-type LISI (cLISI) score
`hvg_overlap`(adata_pre, adata_post, batch_key)	Highly variable gene overlap
`isolated_labels_asw`(adata, label_key, ...[, ...])	Isolated label score ASW
`isolated_labels_f1`(adata, label_key, ...[, ...])	Isolated label score F1
`nmi`(adata, cluster_key, label_key[, ...])	Normalized mutual information
`silhouette`(adata, label_key, embed[, ...])	Average silhouette width (ASW)
`trajectory_conservation`(adata_pre, ...[, ...])	Trajectory conservation score

Batch Correction Metrics

Batch correction metrics values are scaled by default between 0 and 1, in which larger scores represent better batch removal.

`graph_connectivity`(adata, label_key)	Graph Connectivity
`ilisi_graph`(adata, batch_key, type_[, ...])	Integration LISI (iLISI) score
`kBET`(adata, batch_key, label_key, type_[, ...])	kBET score
`pcr_comparison`(adata_pre, adata_post, covariate)	Principal component regression score
`silhouette_batch`(adata, batch_key, ...[, ...])	Batch ASW

Metrics Wrapper Functions

For convenience, scib provides wrapper functions that, given integrated and unintegrated adata objects, apply multiple metrics and return all the results in a pandas.Dataframe. The main function is metrics(), that provides all the parameters for the different metrics.

scib.metrics.metrics(adata, adata_int, ari=True, nmi=True)

The remaining functions call the metrics() for

Furthermore, metrics() is wrapped by convenience functions with preconfigured subsets of metrics based on expected computation time:

metrics_fast() only computes metrics that require little preprocessing
metrics_slim() includes all functions of metrics_fast() and adds clustering-based metrics
metrics_all() includes all metrics

`metrics`(adata, adata_int, batch_key, label_key)	Master metrics function
`metrics_fast`(adata, adata_int, batch_key, ...)	Only metrics with minimal preprocessing and runtime
`metrics_slim`(adata, adata_int, batch_key, ...)	All metrics apart from kBET and LISI scores
`metrics_all`(adata, adata_int, batch_key, ...)	All metrics

Auxiliary Functions

Some parts of metrics can be used individually, these are listed below.

`cluster_optimal_resolution`(adata, label_key, ...)	Optimised clustering
`get_resolutions`([n, min, max])	Get equally spaced resolutions for optimised clustering
`lisi_graph`(adata, batch_key, label_key, **kwargs)	cLISI and iLISI scores
`pcr`(adata, covariate[, embed, n_comps, ...])	Principal component regression for anndata object
`pc_regression`(data, covariate[, pca_var, ...])	Principal component regression

PCR Regression Backends

The principal component regression metric can use multiple linear regression backends. These helpers are exposed here for advanced usage and benchmarking.

`linreg_sklearn`(X_pca, covariate[, n_jobs])	Sequential sklearn regression backend for PCR.
`linreg_multiple_sklearn`(X_pca, covariate[, ...])	Multi-output sklearn regression backend for PCR.
`linreg_multiple_np`(X_pca, covariate[, n_jobs])	Compute per-PC \(R^2\) with a dense numpy regression backend.