A plethora of SCO tools exist, and yet standards on how to benchmark or evaluate the accuracy of each tool are lacking. Furthermore, most benchmark efforts are focused on certain cell types or tissues.
Researchers in computational biology and other scientific fields often encounter the challenge of selecting the most appropriate computational methods to analyze their data. To determine the strengths of each method or to make recommendations on the best options for an analysis, benchmarking studies are conducted, which rigorously compare the performance of different methods using well-characterized benchmark datasets. However, to ensure that the results are reliable, unbiased, and informative, benchmarking studies must be carefully designed and executed. Weber et al. (2019) provided a set of practical guidelines and recommendations for conducting benchmarking analyses of high quality.
As single-cell technologies continue to advance, an increasing number of analysis tools are becoming available to researchers. Consequently, there is a growing need for datasets and methods that support systematic benchmarking and evaluation of these tools. Validating and benchmarking analysis tools for single-cell measurements is part of the “Eleven grand challenges in single-cell data science” [Lähnemann et al. (2020)].
Benchmarking can be approached at two different level:
The validation and comparison of tools in the field of simulation is an intriguing area. Despite the fact that simulators have several limitations, there has been a significant effort in the past few years to enhance their power, as noted by Cao et al. (2021). The scDesign3, a newly published simulator by Song et al. (2023), appears to be a versatile tool that is capable of overcoming some of the limitations of other existing simulators, according to the developers.
Over the past few years, there has been a significant increase in the number of published single-cell omics studies, which serve as valuable resources for benchmark experiments. In particular, Svensson et al. (2020) have compiled a comprehensive collection of single-cell omics datasets with manually curated metadata. As part of the implementation study of the single-cell omics community, we are creating a series of datasets specifically designed for benchmarking computational tools that focus on single-cell tumor heterogeneity.
Tumor heterogeneity, where distinct cancer cells exhibit diverse morphological and phenotypic profiles, including gene expression, metabolism, and proliferation, poses challenges for molecular prognostic markers and patient classification for targeted therapies. Various omics technologies, such as bulk [Babu & Snyder (2023)] and single-cell omics [Flynn et al. (2023)] approaches, have enabled the characterization of diverse molecular layers at an unprecedented scale and resolution, offering a comprehensive perspective on the behavior of tumors. The integration of multiple omics datasets enables systematic exploration of diverse molecular information [Yue et al. (2023)] at each biological layer, but also presents challenges in extracting meaningful insights from the exponentially growing volume of multi-omics data. To address this challenge, efficient algorithms are required to dig into the data and reveal the underlying complexities of cancer’s intricate biological processes. The past few years have seen a proliferation of new computational methods for analyzing single-cell omics data, which can make it challenging to select the most appropriate tool for a particular task. As a result, it is crucial to establish benchmarking platforms [Mangul et al. (2019), Decamps et al. (2021), OpenEBench, Omnibenchmark, Knight et al. (2023)] and datasets [Tian et al. (2018), Refine.bio] in order to create a controlled environment for the validation of bioinformatics tools in the field of single-cell omics analysis.
CELLxGENE is a suite of tools that help scientists to find, download, explore, analyze, annotate, and publish single cell datasets. it is characterised by the possibility to download a wide set of published single cell experiments as h5 or seurat (v3) format.
Data were generated using 10XGenomics v2 chemistry. The raw count table was provided without the association of the cell lines to each cell. We have assigned the cell line name to each cell via the similarity between single cell clusters and cell lines bulk data from CCLE database. The annotated count table and the full procedure used for the annotation are present a figshare dataset [doi.org/10.6084/m9.figshare.23274413.v1]
In GEO NCBI repository are available single cell data on PC9 untreated lung cancer cell line done in two different labs, using both 10XGenomics and Drop-seq platforms and produced from in vitro culture or xenograph experiments.
These datasets, because of their differences, i.e. platforms, growth, labs, represent an ideal instrument to benchmark batch removal methods as well as integration methods. The seven sets are available as datasets with the same gene annotation (ensemblID:Symbol). Each set has the cellID with the extension _s(1:5) and set6 has two extensions _s6PC9 and _s6U937. These datasets and the script used for annotation are available as figshare repository (10.6084/m9.figshare.23626407).
Gavish and collaborators [Nature 2023] have recenty curated, annotated and integrated the data from 77 different single cell transcriptomics studies encompassing a total of 1,163 tumour samples covering 24 tumour types and more than 23 milions cells. The data are accessible at 3CA site.
As part of the Single Cell Community implementation study we are focusing in providing a set of benchmark experiments to address the extraction of biological knowledge from “controlled” cancer heterogeneity.
We have done a 10XGenomics scRNAseq experiment including the following elements:
The above figure described the driver genes associated to each cell line. Only a minimal part of the connections has been shown to easy readability of the image. Full list of the interactions depicted by IPA are available at figshare [10.6084/m9.figshare.23284748]. All driver genes have been observed in resistence occuring upon treatment with Isomertinib of EGFR mutated lung cancers [Gomatou et al. (2023)].
The experiment was done using CellPlex technology from 10XGenomics allowing multiplexing samples into a single channel and therefore removing unwanted batch effects.
The count tables from the entire BE1 experiment are available through an R Shiny app, allowing users to construct datasets encompassing different cell lines at varying ratios. An R package providing the same functionalites of the R Shiny app is available at github.com/kendomaniac/BE1.
The above figure show the sequencing statistics of the 7 cell lines.
Processed data are available at: 10.6084/m9.figshare.23939481
For further information please contact raffaele dot calogero at unito dot it
The cell lines from for BE1 will be used to generate surrogate tumor-tissues for spatial transcriptomics, by embedding in matrigel pools of the 7 cell lines at different ratios. For each cell ratio we will generate a slide (2 sections) using Visium for FFPE samples and six slides using Curio Bioscience spatial platform for OTC fresh frozen samples. In total we expect produce three cell lines ratios:
Actual state of the project: Expecting results from BE1.
Expected data availability: March 2024
A549, CCL.185.IG, NCI-H596 (HTB178) and PC9 cell lines will be used to generate combined scRNAseq and scATACseq experiment using 10X genomics technology for multi-omics.
Actual state of the project: Ordered 10XGenomics multi-omics kit.
Expected data availability: January 2024