vignettes/export_a_data_set_in_SCE_format.Rmd
export_a_data_set_in_SCE_format.Rmd
Here, I explain how one can export a data set stored in the SingleCellExperiment
/SCE
format to visualize it in Cerebro.
The workflow used here is loosely put together from the scran
and scater
examples. Do not use this workflow as a reference for how to properly process your data.
Before starting the workflow, we need to install cerebroApp
, as well as the scran
, scater
and SingleR
packages, which are not installed as dependencies of cerebroApp because they are only necessary if you want/need to pre-process your scRNA-seq data. If you just want to launch the Cerebro user interface, e.g. because you already have the pre-processed data, you don’t need those packages.
if ( 'BiocManager' %in% installed.packages() == FALSE ) install.packages('BiocManager')
library(BiocManager)
if ( 'cerebroApp' %in% installed.packages() == FALSE ) install('romanhaa/cerebroApp')
if ( 'scran' %in% installed.packages() == FALSE ) install('scran')
#>
#> The downloaded binary packages are in
#> /var/folders/7z/m3l07m8s6455y1mnykf307_c0000gn/T//RtmpcQU4bH/downloaded_packages
if ( 'scater' %in% installed.packages() == FALSE ) install('scater')
#>
#> The downloaded binary packages are in
#> /var/folders/7z/m3l07m8s6455y1mnykf307_c0000gn/T//RtmpcQU4bH/downloaded_packages
if ( 'SingleR' %in% installed.packages() == FALSE ) install('SingleR')
if ( 'monocle' %in% installed.packages() == FALSE ) install('monocle')
Apart from the packages we just installed, we will also load the dplyr
package and set a seed to make our analysis reproducible.
In this example workflow, we will load a small transcript count table from the Seurat
package containing 80 cells and 230 genes. If you want to try the workflow with a real data set, you can find some transcript count matrices on the Cerebro GitHub repo.
pbmc_counts <- read.table(
file = system.file('extdata', 'pbmc_raw.txt', package = 'Seurat'),
as.is = TRUE
) %>% as.matrix()
scran
and scater
Now, we create a SingleCellExperiment
object and filter out cells with less than 50
transcripts or fewer than 10
expressed genes. Those cut-offs are only reasonable for this example data set and will likely need to be adjusted in a real data set. Then, we…
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(counts = pbmc_counts)
)
qc_metrics <- perCellQCMetrics(sce)
colData(sce) <- cbind(colData(sce), qc_metrics)
cells_to_keep <- sce$sum >= 50 & sce$detected >= 10
sce <- sce[,cells_to_keep]
genes_to_keep <- nexprs(sce, byrow=TRUE) >= 5
sce <- sce[genes_to_keep,]
sce <- logNormCounts(sce)
variable_genes <- modelGeneVar(sce)
highly_variable_genes <- getTopHVGs(variable_genes, n = 50)
sce <- runPCA(sce)
Then, we identify clusters using a graph-based approach.
graph <- buildSNNGraph(sce, use.dimred = "PCA")
cluster <- igraph::cluster_walktrap(graph)$membership
sce$cluster <- factor(cluster)
Next, we use the UMAP technique to generate two- and three-dimensional projections.
This example data set consists of a single sample so we just add that name to the meta data. Moreover, in order to be able to understand how we did the analysis later, we add some meta data to the misc
slot of our SCE
object, e.g. an experiment name and the species. The info we store there will later be collected for visualization in Cerebro.
sce$sample <- factor('pbmc', levels = 'pbmc')
sce@metadata$experiment <- list(
experiment_name = 'pbmc',
organism = 'hg',
date_of_analysis = Sys.Date()
)
sce@metadata$parameters <- list(
gene_nomenclature = 'gene_name',
discard_genes_expressed_in_fewer_cells_than = 10,
keep_mitochondrial_genes = TRUE,
variables_to_regress_out = 'nUMI'
)
sce@metadata$parameters$filtering <- list(
UMI_min = 50,
UMI_max = Inf,
genes_min = 10,
genes_max = Inf
)
sce@metadata$technical_info$cerebroApp_version <- utils::packageVersion('cerebroApp')
sce@metadata$technical_info$scran <- utils::packageVersion('scran')
sce@metadata$technical_info$scater <- utils::packageVersion('scater')
sce@metadata$technical_info <- list(
'R' = capture.output(devtools::session_info())
)
Using the SingleR
package, we can get a suggestion of cell type for each cell. Apart from the suggested cell type, we also extract the score that might help us to understand how confident the assignment was.
singler_ref <- BlueprintEncodeData()
#> Warning: 'BlueprintEncodeData' is deprecated.
#> Use 'celldex::BlueprintEncodeData' instead.
#> See help("Deprecated")
#> using temporary cache /var/folders/7z/m3l07m8s6455y1mnykf307_c0000gn/T//RtmpcQU4bH/BiocFileCache
#> snapshotDate(): 2020-10-27
#> see ?celldex and browseVignettes('celldex') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> see ?celldex and browseVignettes('celldex') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
singler_results_blueprintencode_main <- SingleR(
test = logcounts(sce),
ref = singler_ref,
labels = singler_ref@colData@listData$label.main
)
sce$cell_type_singler_blueprintencode_main <- singler_results_blueprintencode_main@listData$labels
singler_scores <- singler_results_blueprintencode_main@listData$scores %>%
as_tibble() %>%
dplyr::mutate(assigned_score = NA)
for ( i in seq_len(nrow(singler_scores)) ) {
singler_scores$assigned_score[i] <- singler_scores[[singler_results_blueprintencode_main@listData$labels[i]]][i]
}
sce$cell_type_singler_blueprintencode_main_score <- singler_scores$assigned_score
##Identify marker genes
Marker genes are genes which are particularly strongly or weakly expressed in a given cell population, e.g. a cluster. This information can help to distinguish the role of different cell groups in the data set. Here, marker genes are identified for clusters and cell types using the findMarkers()
function from the scran
package. The results are separate tables for the respective subgroups, which we merge together in order to make them compatible with Cerebro.
sce@metadata$marker_genes <- list()
sce@metadata$marker_genes$scran <- list()
marker_genes_cluster <- findMarkers(sce, sce$cluster)
for ( i in seq_along(marker_genes_cluster) ) {
marker_genes_cluster[[i]] <- as.data.frame(marker_genes_cluster[[i]]) %>%
mutate(
cluster = names(marker_genes_cluster)[i],
gene = rownames(.)
)
marker_genes_cluster[[i]][[paste0('logFC.', names(marker_genes_cluster)[i])]] <- 0
}
marker_genes_cluster <- do.call(bind_rows, marker_genes_cluster) %>%
select(cluster, gene, Top, p.value, FDR, everything()) %>%
mutate(cluster = factor(cluster, levels = levels(colData(sce)$cluster)))
sce@metadata$marker_genes$scran$cluster <- marker_genes_cluster %>%
filter(!is.na(Top), FDR <= 0.1)
glimpse(marker_genes_cluster)
#> Rows: 876
#> Columns: 10
#> $ cluster <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ gene <chr> "HLA-DRA", "CST3", "GPX1", "TYMP", "HLA-DRB1", "PPBP", "CD79B", "LYZ", "IFI…
#> $ Top <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 11…
#> $ p.value <dbl> 7.673681e-23, 3.683285e-30, 2.667130e-17, 1.174240e-17, 2.632735e-11, 1.904…
#> $ FDR <dbl> 8.402681e-21, 8.066393e-28, 1.168203e-15, 7.557534e-16, 7.207111e-10, 5.957…
#> $ summary.logFC <dbl> -4.7467317, -3.6811223, -3.5493442, -1.6773267, -3.6868802, -5.6944230, -2.…
#> $ logFC.2 <dbl> -2.60629181, -3.68112226, -1.05420348, -1.67732667, -1.96732725, -0.0458601…
#> $ logFC.3 <dbl> -4.74673175, -0.12080297, -0.05119315, 0.13734308, -3.68688025, -0.05364607…
#> $ logFC.4 <dbl> -0.09773581, -0.79739926, -3.54934421, 0.13734308, 0.28378634, -5.69442304,…
#> $ logFC.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
marker_genes_cluster %>% head() %>% knitr::kable()
cluster | gene | Top | p.value | FDR | summary.logFC | logFC.2 | logFC.3 | logFC.4 | logFC.1 |
---|---|---|---|---|---|---|---|---|---|
1 | HLA-DRA | 1 | 0 | 0 | -4.746732 | -2.6062918 | -4.7467317 | -0.0977358 | 0 |
1 | CST3 | 1 | 0 | 0 | -3.681122 | -3.6811223 | -0.1208030 | -0.7973993 | 0 |
1 | GPX1 | 1 | 0 | 0 | -3.549344 | -1.0542035 | -0.0511932 | -3.5493442 | 0 |
1 | TYMP | 2 | 0 | 0 | -1.677327 | -1.6773267 | 0.1373431 | 0.1373431 | 0 |
1 | HLA-DRB1 | 2 | 0 | 0 | -3.686880 | -1.9673272 | -3.6868802 | 0.2837863 | 0 |
1 | PPBP | 2 | 0 | 0 | -5.694423 | -0.0458601 | -0.0536461 | -5.6944230 | 0 |
marker_genes_cell_type <- findMarkers(sce, sce$cell_type_singler_blueprintencode_main)
#> Warning in FUN(...): no within-block comparison between Fibroblasts and B-cells
#> Warning in FUN(...): no within-block comparison between Fibroblasts and CD4+ T-cells
#> Warning in FUN(...): no within-block comparison between Fibroblasts and CD8+ T-cells
#> Warning in FUN(...): no within-block comparison between Fibroblasts and Endothelial cells
#> Warning in FUN(...): no within-block comparison between Fibroblasts and Erythrocytes
#> Warning in FUN(...): no within-block comparison between Monocytes and Fibroblasts
#> Warning in FUN(...): no within-block comparison between NK cells and Fibroblasts
for ( i in seq_along(marker_genes_cell_type) ) {
marker_genes_cell_type[[i]] <- as.data.frame(marker_genes_cell_type[[i]]) %>%
mutate(
cell_type_singler_blueprintencode_main = names(marker_genes_cell_type)[i],
gene = rownames(.)
)
marker_genes_cell_type[[i]][[paste0('logFC.', gsub(names(marker_genes_cell_type)[i], pattern = ' |\\-|\\+', replacement = '.'))]] <- 0
}
marker_genes_cell_type <- do.call(bind_rows, marker_genes_cell_type) %>%
select(cell_type_singler_blueprintencode_main, gene, Top, p.value, FDR, everything()) %>%
mutate(cell_type_singler_blueprintencode_main = factor(
cell_type_singler_blueprintencode_main,
levels = unique(cell_type_singler_blueprintencode_main)
)
)
sce@metadata$marker_genes$scran$cell_type_singler_blueprintencode_main <- marker_genes_cell_type %>%
filter(!is.na(Top), FDR <= 0.1)
glimpse(marker_genes_cell_type)
#> Rows: 1,752
#> Columns: 14
#> $ cell_type_singler_blueprintencode_main <fct> B-cells, B-cells, B-cells, B-cells, B-cells, B-cel…
#> $ gene <chr> "HLA-DRA", "HLA-DQB1", "TYROBP", "HLA-DRB1", "PPBP…
#> $ Top <int> 1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5, 5, 6,…
#> $ p.value <dbl> 3.945181e-12, 4.288910e-08, 7.668231e-19, 2.381471…
#> $ FDR <dbl> 1.439991e-10, 6.709081e-07, 1.679343e-16, 6.519278…
#> $ summary.logFC <dbl> 4.786161, 1.870992, -2.981126, 3.538704, -5.670599…
#> $ logFC.CD4..T.cells <dbl> 4.53437330, 2.57924662, -0.53466751, 3.97066659, -…
#> $ logFC.CD8..T.cells <dbl> 4.9264915, 2.8249252, -0.5007120, 3.5898855, 0.120…
#> $ logFC.Endothelial.cells <dbl> 4.8046017, 2.8249252, 0.0000000, 3.9706666, -6.073…
#> $ logFC.Erythrocytes <dbl> 4.56757886, 2.82492521, 0.00000000, 3.97066659, -5…
#> $ logFC.Fibroblasts <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ logFC.Monocytes <dbl> 2.140439941, 1.870991975, -2.981126027, 1.71955299…
#> $ logFC.NK.cells <dbl> 4.78616075, 2.82492521, -2.34779876, 3.53870363, 0…
#> $ logFC.B.cells <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
marker_genes_cell_type %>% select(1:6) %>% head() %>% knitr::kable(booktabs = T)
cell_type_singler_blueprintencode_main | gene | Top | p.value | FDR | summary.logFC |
---|---|---|---|---|---|
B-cells | HLA-DRA | 1 | 0 | 0e+00 | 4.786161 |
B-cells | HLA-DQB1 | 1 | 0 | 7e-07 | 1.870992 |
B-cells | TYROBP | 1 | 0 | 0e+00 | -2.981126 |
B-cells | HLA-DRB1 | 1 | 0 | 1e-07 | 3.538704 |
B-cells | PPBP | 1 | 0 | 0e+00 | -5.670599 |
B-cells | CD79B | 2 | 0 | 1e-07 | 2.716659 |
Finally, we use the exportFromSeurat()
function of cerebroApp to export our Seurat object to a .crb
file which can be loaded into Cerebro.
exportFromSCE(
sce,
assay = 'logcounts',
file = paste0('cerebro_pbmc_SCE_', Sys.Date(), '.crb'),
experiment_name = 'pbmc',
organism = 'hg',
groups = c('sample','cluster','cell_type_singler_blueprintencode_main'),
nUMI = 'sum',
nGene = 'detected',
add_all_meta_data = TRUE,
verbose = FALSE
)
#> [17:58:05] Start collecting data...
#> Warning in if (class(expression_data) == "try-error") {: the condition has length > 1 and only the
#> first element will be used
#> [17:58:05] Overview of Cerebro object:
#> class: Cerebro_v1.3
#> cerebroApp version: 1.3.1
#> experiment name: pbmc
#> organism: hg
#> date of analysis: 2021-03-12
#> date of export: 2021-03-12
#> number of cells: 79
#> number of genes: 219
#> grouping variables (3): sample, cluster, cell_type_singler_blueprintencode_main
#> cell cycle variables (0):
#> projections (2): UMAP, UMAP_3D
#> trees (0):
#> most expressed genes:
#> marker genes:
#> - scran (2): cluster, cell_type_singler_blueprintencode_main
#> enriched pathways:
#> trajectories:
#> extra material:
#> [17:58:05] Saving Cerebro object to: cerebro_pbmc_SCE_2021-03-12.crb
#> [17:58:05] Done!
The Cerebro use interface can be launched using the launchCerebro()
function.
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] celldex_1.0.0 cerebroApp_1.3.1 SingleR_1.4.1
#> [4] scater_1.18.6 ggplot2_3.3.2 scran_1.18.5
#> [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 Biobase_2.50.0
#> [10] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2 IRanges_2.24.1
#> [13] S4Vectors_0.28.1 BiocGenerics_0.36.0 MatrixGenerics_1.2.0
#> [16] matrixStats_0.57.0 dplyr_1.0.2 BiocManager_1.30.10
#>
#> loaded via a namespace (and not attached):
#> [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 systemfonts_1.0.1
#> [4] plyr_1.8.6 igraph_1.2.6 lazyeval_0.2.2
#> [7] splines_4.0.3 GSEABase_1.52.1 shinydashboard_0.7.1
#> [10] BiocParallel_1.24.1 listenv_0.8.0 usethis_2.0.0
#> [13] digest_0.6.27 htmltools_0.5.1.1 viridis_0.5.1
#> [16] fansi_0.4.1 magrittr_2.0.1 memoise_1.1.0
#> [19] remotes_2.2.0 limma_3.46.0 shinyFiles_0.9.0
#> [22] readr_1.4.0 globals_0.14.0 annotate_1.68.0
#> [25] askpass_1.1 pkgdown_1.6.1 prettyunits_1.1.1
#> [28] colorspace_2.0-0 blob_1.2.1 rappdirs_0.3.1
#> [31] textshaping_0.3.0 xfun_0.19 callr_3.5.1
#> [34] crayon_1.3.4 RCurl_1.98-1.2 jsonlite_1.7.2
#> [37] graph_1.68.0 ape_5.4-1 glue_1.4.2
#> [40] gtable_0.3.0 zlibbioc_1.36.0 XVector_0.30.0
#> [43] DelayedArray_0.16.1 pkgbuild_1.2.0 BiocSingular_1.6.0
#> [46] future.apply_1.6.0 scales_1.1.1 msigdbr_7.2.1
#> [49] DBI_1.1.0 edgeR_3.32.1 miniUI_0.1.1.1
#> [52] Rcpp_1.0.5 viridisLite_0.3.0 xtable_1.8-4
#> [55] progress_1.2.2 dqrng_0.2.1 bit_4.0.4
#> [58] rsvd_1.0.3 GSVA_1.38.0 DT_0.16
#> [61] htmlwidgets_1.5.3 httr_1.4.2 FNN_1.1.3
#> [64] ellipsis_0.3.1 pkgconfig_2.0.3 XML_3.99-0.5
#> [67] scuttle_1.0.4 uwot_0.1.10 dbplyr_2.0.0
#> [70] utf8_1.1.4 locfit_1.5-9.4 reshape2_1.4.4
#> [73] tidyselect_1.1.0 rlang_0.4.9 later_1.1.0.1
#> [76] AnnotationDbi_1.52.0 BiocVersion_3.12.0 munsell_0.5.0
#> [79] tools_4.0.3 cli_2.2.0 ExperimentHub_1.16.0
#> [82] generics_0.1.0 RSQLite_2.2.1 devtools_2.3.2
#> [85] evaluate_0.14 stringr_1.4.0 fastmap_1.0.1
#> [88] yaml_2.2.1 ragg_1.1.0 processx_3.4.5
#> [91] knitr_1.30 bit64_4.0.5 fs_1.5.0
#> [94] shinycssloaders_1.0.0 purrr_0.3.4 pbapply_1.4-3
#> [97] future_1.21.0 nlme_3.1-151 sparseMatrixStats_1.2.1
#> [100] mime_0.9 xml2_1.3.2 biomaRt_2.46.0
#> [103] compiler_4.0.3 interactiveDisplayBase_1.28.0 beeswarm_0.3.1
#> [106] plotly_4.9.2.2 curl_4.3 testthat_3.0.1
#> [109] tibble_3.0.4 statmod_1.4.35 stringi_1.5.3
#> [112] highr_0.8 ps_1.5.0 RSpectra_0.16-0
#> [115] desc_1.2.0 lattice_0.20-41 bluster_1.0.0
#> [118] Matrix_1.3-0 shinyjs_2.0.0 vctrs_0.3.6
#> [121] pillar_1.4.7 lifecycle_0.2.0 BiocNeighbors_1.8.2
#> [124] data.table_1.13.4 bitops_1.0-6 irlba_2.3.3
#> [127] qvalue_2.22.0 httpuv_1.5.4 R6_2.5.0
#> [130] promises_1.1.1 gridExtra_2.3 vipor_0.4.5
#> [133] parallelly_1.22.0 sessioninfo_1.1.1 codetools_0.2-18
#> [136] pkgload_1.1.0 colourpicker_1.1.0 assertthat_0.2.1
#> [139] openssl_1.4.3 rprojroot_2.0.2 shinyWidgets_0.5.4
#> [142] withr_2.3.0 GenomeInfoDbData_1.2.4 hms_0.5.3
#> [145] grid_4.0.3 beachmat_2.6.4 tidyr_1.1.2
#> [148] rmarkdown_2.6 DelayedMatrixStats_1.12.3 shiny_1.5.0
#> [151] ggbeeswarm_0.6.0