Overview

Here, I explain how one can export a data set stored in the SingleCellExperiment/SCE format to visualize it in Cerebro.

The workflow used here is loosely put together from the scran and scater examples. Do not use this workflow as a reference for how to properly process your data.

Installation

Before starting the workflow, we need to install cerebroApp, as well as the scran, scater and SingleR packages, which are not installed as dependencies of cerebroApp because they are only necessary if you want/need to pre-process your scRNA-seq data. If you just want to launch the Cerebro user interface, e.g. because you already have the pre-processed data, you don’t need those packages.

if ( 'BiocManager' %in% installed.packages() == FALSE ) install.packages('BiocManager')

library(BiocManager)

if ( 'cerebroApp' %in% installed.packages() == FALSE ) install('romanhaa/cerebroApp')
if ( 'scran' %in% installed.packages() == FALSE ) install('scran')
#> 
#> The downloaded binary packages are in
#>  /var/folders/7z/m3l07m8s6455y1mnykf307_c0000gn/T//RtmpcQU4bH/downloaded_packages
if ( 'scater' %in% installed.packages() == FALSE ) install('scater')
#> 
#> The downloaded binary packages are in
#>  /var/folders/7z/m3l07m8s6455y1mnykf307_c0000gn/T//RtmpcQU4bH/downloaded_packages
if ( 'SingleR' %in% installed.packages() == FALSE ) install('SingleR')
if ( 'monocle' %in% installed.packages() == FALSE ) install('monocle')

Load packages

Apart from the packages we just installed, we will also load the dplyr package and set a seed to make our analysis reproducible.

Load count data

In this example workflow, we will load a small transcript count table from the Seurat package containing 80 cells and 230 genes. If you want to try the workflow with a real data set, you can find some transcript count matrices on the Cerebro GitHub repo.

pbmc_counts <- read.table(
    file = system.file('extdata', 'pbmc_raw.txt', package = 'Seurat'),
    as.is = TRUE
  ) %>% as.matrix()

Pre-processing with scran and scater

Now, we create a SingleCellExperiment object and filter out cells with less than 50 transcripts or fewer than 10 expressed genes. Those cut-offs are only reasonable for this example data set and will likely need to be adjusted in a real data set. Then, we…

  • remove genes expressed in less than 5 cells,
  • log-normalize transcript counts,
  • identify highly variably genes,
  • perform principal component analysis (PCA).
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(counts = pbmc_counts)
)

qc_metrics <- perCellQCMetrics(sce)
colData(sce) <- cbind(colData(sce), qc_metrics)

cells_to_keep <- sce$sum >= 50 & sce$detected >= 10
sce <- sce[,cells_to_keep]

genes_to_keep <- nexprs(sce, byrow=TRUE) >= 5
sce <- sce[genes_to_keep,]

sce <- logNormCounts(sce)

variable_genes <- modelGeneVar(sce)
highly_variable_genes <- getTopHVGs(variable_genes, n = 50)

sce <- runPCA(sce)

Clustering

Then, we identify clusters using a graph-based approach.

graph <- buildSNNGraph(sce, use.dimred = "PCA")
cluster <- igraph::cluster_walktrap(graph)$membership
sce$cluster <- factor(cluster)

Dimensional reduction

Next, we use the UMAP technique to generate two- and three-dimensional projections.

sce <- runUMAP(
  sce,
  name = 'UMAP',
  ncomponents = 2,
)

sce <- runUMAP(
  sce,
  name = 'UMAP_3D',
  ncomponents = 3,
)

Curate meta data

This example data set consists of a single sample so we just add that name to the meta data. Moreover, in order to be able to understand how we did the analysis later, we add some meta data to the misc slot of our SCE object, e.g. an experiment name and the species. The info we store there will later be collected for visualization in Cerebro.

sce$sample <- factor('pbmc', levels = 'pbmc')

sce@metadata$experiment <- list(
  experiment_name = 'pbmc',
  organism = 'hg',
  date_of_analysis = Sys.Date()
)

sce@metadata$parameters <- list(
  gene_nomenclature = 'gene_name',
  discard_genes_expressed_in_fewer_cells_than = 10,
  keep_mitochondrial_genes = TRUE,
  variables_to_regress_out = 'nUMI'
)

sce@metadata$parameters$filtering <- list(
  UMI_min = 50,
  UMI_max = Inf,
  genes_min = 10,
  genes_max = Inf
)

sce@metadata$technical_info$cerebroApp_version <- utils::packageVersion('cerebroApp')
sce@metadata$technical_info$scran <- utils::packageVersion('scran')
sce@metadata$technical_info$scater <- utils::packageVersion('scater')
sce@metadata$technical_info <- list(
  'R' = capture.output(devtools::session_info())
)

Assign cell types

Using the SingleR package, we can get a suggestion of cell type for each cell. Apart from the suggested cell type, we also extract the score that might help us to understand how confident the assignment was.

singler_ref <- BlueprintEncodeData()
#> Warning: 'BlueprintEncodeData' is deprecated.
#> Use 'celldex::BlueprintEncodeData' instead.
#> See help("Deprecated")
#> using temporary cache /var/folders/7z/m3l07m8s6455y1mnykf307_c0000gn/T//RtmpcQU4bH/BiocFileCache
#> snapshotDate(): 2020-10-27
#> see ?celldex and browseVignettes('celldex') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> see ?celldex and browseVignettes('celldex') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache

singler_results_blueprintencode_main <- SingleR(
  test = logcounts(sce),
  ref = singler_ref,
  labels = singler_ref@colData@listData$label.main
)

sce$cell_type_singler_blueprintencode_main <- singler_results_blueprintencode_main@listData$labels

singler_scores <- singler_results_blueprintencode_main@listData$scores %>%
  as_tibble() %>%
  dplyr::mutate(assigned_score = NA)

for ( i in seq_len(nrow(singler_scores)) ) {
  singler_scores$assigned_score[i] <- singler_scores[[singler_results_blueprintencode_main@listData$labels[i]]][i]
}

sce$cell_type_singler_blueprintencode_main_score <- singler_scores$assigned_score

##Identify marker genes

Marker genes are genes which are particularly strongly or weakly expressed in a given cell population, e.g. a cluster. This information can help to distinguish the role of different cell groups in the data set. Here, marker genes are identified for clusters and cell types using the findMarkers() function from the scran package. The results are separate tables for the respective subgroups, which we merge together in order to make them compatible with Cerebro.

Clusters

sce@metadata$marker_genes <- list()
sce@metadata$marker_genes$scran <- list()

marker_genes_cluster <- findMarkers(sce, sce$cluster)

for ( i in seq_along(marker_genes_cluster) ) {
  marker_genes_cluster[[i]] <- as.data.frame(marker_genes_cluster[[i]]) %>%
    mutate(
      cluster = names(marker_genes_cluster)[i],
      gene = rownames(.)
    )
  marker_genes_cluster[[i]][[paste0('logFC.', names(marker_genes_cluster)[i])]] <- 0
}

marker_genes_cluster <- do.call(bind_rows, marker_genes_cluster) %>%
  select(cluster, gene, Top, p.value, FDR, everything()) %>%
  mutate(cluster = factor(cluster, levels = levels(colData(sce)$cluster)))

sce@metadata$marker_genes$scran$cluster <- marker_genes_cluster %>%
  filter(!is.na(Top), FDR <= 0.1)

glimpse(marker_genes_cluster)
#> Rows: 876
#> Columns: 10
#> $ cluster       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ gene          <chr> "HLA-DRA", "CST3", "GPX1", "TYMP", "HLA-DRB1", "PPBP", "CD79B", "LYZ", "IFI…
#> $ Top           <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 8, 9, 9, 10, 11…
#> $ p.value       <dbl> 7.673681e-23, 3.683285e-30, 2.667130e-17, 1.174240e-17, 2.632735e-11, 1.904…
#> $ FDR           <dbl> 8.402681e-21, 8.066393e-28, 1.168203e-15, 7.557534e-16, 7.207111e-10, 5.957…
#> $ summary.logFC <dbl> -4.7467317, -3.6811223, -3.5493442, -1.6773267, -3.6868802, -5.6944230, -2.…
#> $ logFC.2       <dbl> -2.60629181, -3.68112226, -1.05420348, -1.67732667, -1.96732725, -0.0458601…
#> $ logFC.3       <dbl> -4.74673175, -0.12080297, -0.05119315, 0.13734308, -3.68688025, -0.05364607…
#> $ logFC.4       <dbl> -0.09773581, -0.79739926, -3.54934421, 0.13734308, 0.28378634, -5.69442304,…
#> $ logFC.1       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

marker_genes_cluster %>% head() %>% knitr::kable()
cluster gene Top p.value FDR summary.logFC logFC.2 logFC.3 logFC.4 logFC.1
1 HLA-DRA 1 0 0 -4.746732 -2.6062918 -4.7467317 -0.0977358 0
1 CST3 1 0 0 -3.681122 -3.6811223 -0.1208030 -0.7973993 0
1 GPX1 1 0 0 -3.549344 -1.0542035 -0.0511932 -3.5493442 0
1 TYMP 2 0 0 -1.677327 -1.6773267 0.1373431 0.1373431 0
1 HLA-DRB1 2 0 0 -3.686880 -1.9673272 -3.6868802 0.2837863 0
1 PPBP 2 0 0 -5.694423 -0.0458601 -0.0536461 -5.6944230 0

Cell types

marker_genes_cell_type <- findMarkers(sce, sce$cell_type_singler_blueprintencode_main)
#> Warning in FUN(...): no within-block comparison between Fibroblasts and B-cells
#> Warning in FUN(...): no within-block comparison between Fibroblasts and CD4+ T-cells
#> Warning in FUN(...): no within-block comparison between Fibroblasts and CD8+ T-cells
#> Warning in FUN(...): no within-block comparison between Fibroblasts and Endothelial cells
#> Warning in FUN(...): no within-block comparison between Fibroblasts and Erythrocytes
#> Warning in FUN(...): no within-block comparison between Monocytes and Fibroblasts
#> Warning in FUN(...): no within-block comparison between NK cells and Fibroblasts

for ( i in seq_along(marker_genes_cell_type) ) {
  marker_genes_cell_type[[i]] <- as.data.frame(marker_genes_cell_type[[i]]) %>%
    mutate(
      cell_type_singler_blueprintencode_main = names(marker_genes_cell_type)[i],
      gene = rownames(.)
    )
  marker_genes_cell_type[[i]][[paste0('logFC.', gsub(names(marker_genes_cell_type)[i], pattern = ' |\\-|\\+', replacement = '.'))]] <- 0
}

marker_genes_cell_type <- do.call(bind_rows, marker_genes_cell_type) %>%
  select(cell_type_singler_blueprintencode_main, gene, Top, p.value, FDR, everything()) %>%
  mutate(cell_type_singler_blueprintencode_main = factor(
      cell_type_singler_blueprintencode_main,
      levels = unique(cell_type_singler_blueprintencode_main)
    )
  )

sce@metadata$marker_genes$scran$cell_type_singler_blueprintencode_main <- marker_genes_cell_type %>%
  filter(!is.na(Top), FDR <= 0.1)

glimpse(marker_genes_cell_type)
#> Rows: 1,752
#> Columns: 14
#> $ cell_type_singler_blueprintencode_main <fct> B-cells, B-cells, B-cells, B-cells, B-cells, B-cel…
#> $ gene                                   <chr> "HLA-DRA", "HLA-DQB1", "TYROBP", "HLA-DRB1", "PPBP…
#> $ Top                                    <int> 1, 1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5, 5, 6,…
#> $ p.value                                <dbl> 3.945181e-12, 4.288910e-08, 7.668231e-19, 2.381471…
#> $ FDR                                    <dbl> 1.439991e-10, 6.709081e-07, 1.679343e-16, 6.519278…
#> $ summary.logFC                          <dbl> 4.786161, 1.870992, -2.981126, 3.538704, -5.670599…
#> $ logFC.CD4..T.cells                     <dbl> 4.53437330, 2.57924662, -0.53466751, 3.97066659, -…
#> $ logFC.CD8..T.cells                     <dbl> 4.9264915, 2.8249252, -0.5007120, 3.5898855, 0.120…
#> $ logFC.Endothelial.cells                <dbl> 4.8046017, 2.8249252, 0.0000000, 3.9706666, -6.073…
#> $ logFC.Erythrocytes                     <dbl> 4.56757886, 2.82492521, 0.00000000, 3.97066659, -5…
#> $ logFC.Fibroblasts                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ logFC.Monocytes                        <dbl> 2.140439941, 1.870991975, -2.981126027, 1.71955299…
#> $ logFC.NK.cells                         <dbl> 4.78616075, 2.82492521, -2.34779876, 3.53870363, 0…
#> $ logFC.B.cells                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

marker_genes_cell_type %>% select(1:6) %>% head() %>% knitr::kable(booktabs = T)
cell_type_singler_blueprintencode_main gene Top p.value FDR summary.logFC
B-cells HLA-DRA 1 0 0e+00 4.786161
B-cells HLA-DQB1 1 0 7e-07 1.870992
B-cells TYROBP 1 0 0e+00 -2.981126
B-cells HLA-DRB1 1 0 1e-07 3.538704
B-cells PPBP 1 0 0e+00 -5.670599
B-cells CD79B 2 0 1e-07 2.716659

Export to Cerebro format

Finally, we use the exportFromSeurat() function of cerebroApp to export our Seurat object to a .crb file which can be loaded into Cerebro.

exportFromSCE(
  sce,
  assay = 'logcounts',
  file = paste0('cerebro_pbmc_SCE_', Sys.Date(), '.crb'),
  experiment_name = 'pbmc',
  organism = 'hg',
  groups = c('sample','cluster','cell_type_singler_blueprintencode_main'),
  nUMI = 'sum',
  nGene = 'detected',
  add_all_meta_data = TRUE,
  verbose = FALSE
)
#> [17:58:05] Start collecting data...
#> Warning in if (class(expression_data) == "try-error") {: the condition has length > 1 and only the
#> first element will be used
#> [17:58:05] Overview of Cerebro object:
#> class: Cerebro_v1.3
#> cerebroApp version: 1.3.1
#> experiment name: pbmc
#> organism: hg
#> date of analysis: 2021-03-12
#> date of export: 2021-03-12
#> number of cells: 79
#> number of genes: 219
#> grouping variables (3): sample, cluster, cell_type_singler_blueprintencode_main
#> cell cycle variables (0): 
#> projections (2): UMAP, UMAP_3D
#> trees (0): 
#> most expressed genes: 
#> marker genes:
#>   - scran (2): cluster, cell_type_singler_blueprintencode_main
#> enriched pathways:
#> trajectories:
#> extra material:
#> [17:58:05] Saving Cerebro object to: cerebro_pbmc_SCE_2021-03-12.crb
#> [17:58:05] Done!

The Cerebro use interface can be launched using the launchCerebro() function.

Session info

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] celldex_1.0.0               cerebroApp_1.3.1            SingleR_1.4.1              
#>  [4] scater_1.18.6               ggplot2_3.3.2               scran_1.18.5               
#>  [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0 Biobase_2.50.0             
#> [10] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2         IRanges_2.24.1             
#> [13] S4Vectors_0.28.1            BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
#> [16] matrixStats_0.57.0          dplyr_1.0.2                 BiocManager_1.30.10        
#> 
#> loaded via a namespace (and not attached):
#>   [1] AnnotationHub_2.22.0          BiocFileCache_1.14.0          systemfonts_1.0.1            
#>   [4] plyr_1.8.6                    igraph_1.2.6                  lazyeval_0.2.2               
#>   [7] splines_4.0.3                 GSEABase_1.52.1               shinydashboard_0.7.1         
#>  [10] BiocParallel_1.24.1           listenv_0.8.0                 usethis_2.0.0                
#>  [13] digest_0.6.27                 htmltools_0.5.1.1             viridis_0.5.1                
#>  [16] fansi_0.4.1                   magrittr_2.0.1                memoise_1.1.0                
#>  [19] remotes_2.2.0                 limma_3.46.0                  shinyFiles_0.9.0             
#>  [22] readr_1.4.0                   globals_0.14.0                annotate_1.68.0              
#>  [25] askpass_1.1                   pkgdown_1.6.1                 prettyunits_1.1.1            
#>  [28] colorspace_2.0-0              blob_1.2.1                    rappdirs_0.3.1               
#>  [31] textshaping_0.3.0             xfun_0.19                     callr_3.5.1                  
#>  [34] crayon_1.3.4                  RCurl_1.98-1.2                jsonlite_1.7.2               
#>  [37] graph_1.68.0                  ape_5.4-1                     glue_1.4.2                   
#>  [40] gtable_0.3.0                  zlibbioc_1.36.0               XVector_0.30.0               
#>  [43] DelayedArray_0.16.1           pkgbuild_1.2.0                BiocSingular_1.6.0           
#>  [46] future.apply_1.6.0            scales_1.1.1                  msigdbr_7.2.1                
#>  [49] DBI_1.1.0                     edgeR_3.32.1                  miniUI_0.1.1.1               
#>  [52] Rcpp_1.0.5                    viridisLite_0.3.0             xtable_1.8-4                 
#>  [55] progress_1.2.2                dqrng_0.2.1                   bit_4.0.4                    
#>  [58] rsvd_1.0.3                    GSVA_1.38.0                   DT_0.16                      
#>  [61] htmlwidgets_1.5.3             httr_1.4.2                    FNN_1.1.3                    
#>  [64] ellipsis_0.3.1                pkgconfig_2.0.3               XML_3.99-0.5                 
#>  [67] scuttle_1.0.4                 uwot_0.1.10                   dbplyr_2.0.0                 
#>  [70] utf8_1.1.4                    locfit_1.5-9.4                reshape2_1.4.4               
#>  [73] tidyselect_1.1.0              rlang_0.4.9                   later_1.1.0.1                
#>  [76] AnnotationDbi_1.52.0          BiocVersion_3.12.0            munsell_0.5.0                
#>  [79] tools_4.0.3                   cli_2.2.0                     ExperimentHub_1.16.0         
#>  [82] generics_0.1.0                RSQLite_2.2.1                 devtools_2.3.2               
#>  [85] evaluate_0.14                 stringr_1.4.0                 fastmap_1.0.1                
#>  [88] yaml_2.2.1                    ragg_1.1.0                    processx_3.4.5               
#>  [91] knitr_1.30                    bit64_4.0.5                   fs_1.5.0                     
#>  [94] shinycssloaders_1.0.0         purrr_0.3.4                   pbapply_1.4-3                
#>  [97] future_1.21.0                 nlme_3.1-151                  sparseMatrixStats_1.2.1      
#> [100] mime_0.9                      xml2_1.3.2                    biomaRt_2.46.0               
#> [103] compiler_4.0.3                interactiveDisplayBase_1.28.0 beeswarm_0.3.1               
#> [106] plotly_4.9.2.2                curl_4.3                      testthat_3.0.1               
#> [109] tibble_3.0.4                  statmod_1.4.35                stringi_1.5.3                
#> [112] highr_0.8                     ps_1.5.0                      RSpectra_0.16-0              
#> [115] desc_1.2.0                    lattice_0.20-41               bluster_1.0.0                
#> [118] Matrix_1.3-0                  shinyjs_2.0.0                 vctrs_0.3.6                  
#> [121] pillar_1.4.7                  lifecycle_0.2.0               BiocNeighbors_1.8.2          
#> [124] data.table_1.13.4             bitops_1.0-6                  irlba_2.3.3                  
#> [127] qvalue_2.22.0                 httpuv_1.5.4                  R6_2.5.0                     
#> [130] promises_1.1.1                gridExtra_2.3                 vipor_0.4.5                  
#> [133] parallelly_1.22.0             sessioninfo_1.1.1             codetools_0.2-18             
#> [136] pkgload_1.1.0                 colourpicker_1.1.0            assertthat_0.2.1             
#> [139] openssl_1.4.3                 rprojroot_2.0.2               shinyWidgets_0.5.4           
#> [142] withr_2.3.0                   GenomeInfoDbData_1.2.4        hms_0.5.3                    
#> [145] grid_4.0.3                    beachmat_2.6.4                tidyr_1.1.2                  
#> [148] rmarkdown_2.6                 DelayedMatrixStats_1.12.3     shiny_1.5.0                  
#> [151] ggbeeswarm_0.6.0