User Guide
The metrics can either be used to evaluate a dataset or to evaluate the integration performance on different representations of the data. Data representations can be one of the following:
Feature space: a feature-level expression matrix (usually normalised & log-transformed counts or corrected counts from integration)
Embedding space: PCA of the feature-level expression matrix or an embedding from an integration method
kNN graph space: a kNN graph on a PCA or an embedding
Unintegrated feature matrices can simply be reduced to a PCA embedding or kNN graph using the according scanpy
functions or the wrapper reduce_data()
.
The integrated output can come in one of the three representations, which are denoted as “native representation” in the
figure below.
Each metric assumes that the data representation it requires is available in the anndata
object.
If that is not the case, the data needs to be preprocessed accordingly or, for cases where that is not possible, the
metric cannot be applied to that data representation.
This means that native representations in kNN graph space can only be evaluated by graph based metrics, while embedding space outputs can be evaluated by embedding and graph based metrics, the latter requiring processing for the kNN graph. From a feature space native representation one can compute embedding space (PCA on expression matrix, optionally with highly variable gene selection) and kNN graph (on PCA) and all metrics are applicable. These relationships are visualised in the figure below.
The data representation not only determines which metrics can be evaluated, but also how they are computed.
For instance, the principle component regression pcr_comparison()
requires a PCA and variance
contributions, which would classically be computed a full feature matrix but can also be computed on an integrated
embedding matrix.
kNN graph output
KNN graph outputs can only be evaluated with metrics that use a kNN graph as input.
The graph must be located in adata.obsp['neighbors']
and adata.obsp['connectivities']
.
Some of these methods require cluster assignments to compare against known cell labels, which requires clustering.
We recommend to use resolutions that are optimised to the hierarchy of the cell labels for a more representative
cluster assignment.
scib.me.cluster_optimal_resolution(adata, cluster_key="cluster", label_key="celltype")
Embedding output
The embedding should be stored in adata.obsm
, by default under key 'X_emb'
.
If the metric requires an embedding, no preprocessing is needed.
Some metrics require the following:
kNN graph
Clustering with optimised resolution
sc.pp.neighbors(adata, use_rep="X_emb")
scib.me.cluster_optimal_resolution(adata, cluster_key="cluster", label_key="celltype")
Feature output
Feature output processing contains the same steps that are computed on unintegrated data before integration. Feature output requires processing of the count matrix in the following steps:
Highly variable gene selection (skip, if working on feature space subset)
PCA
kNN graph
Clustering with optimised resolution
Note that, depending on the metric, not all steps are required. If the metrics uses an expression matrix, no preprocessing is needed, while the PCA should be used for all metrics that evaluate an embedding.
# use wrapper function for the first 3 steps
scib.pp.reduce_data(
adata, n_top_genes=2000, batch_key="batch", pca=True, neighbors=False
)
scib.me.cluster_optimal_resolution(adata, cluster_key="cluster", label_key="celltype")