Note: if you use BayesPrism in published research, please cite:

Chu, T., Wang, Z., Pe’er, D. Danko, C. G. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat Cancer 3, 505–517 (2022). https://doi.org/10.1038/s43018-022-00356-3

Introduction

BayesPrism performs cell type and gene expression deconvolution for bulk RNA-seq (and spatial transcriptomics) using scRNA-seq of samples collected from matched or similar tissue type. It treats the scRNA-seq as prior information, and estimates \(\mathrm{P}(\theta, Z | X, \phi)\) , i.e. the joint posterior distribution of cell type fraction \(\theta\) and cell type-specific gene expression \(Z\) in each bulk conditional on the reference \(\phi\) and each observed bulk \(X\).

We will demonstrate how to use BayesPrism to deconvolve the bulk RNA-seq dataset TCGA-GBM, using the scRNA-seq dataset collected from 8 hold-out cohorts by Yuan et al.. (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-018-0567-9) .

Load the BayesPrism package

suppressWarnings(library(BayesPrism))
#> Loading required package: snowfall
#> Loading required package: snow
#> Loading required package: NMF
#> Loading required package: pkgmaker
#> Loading required package: registry
#> Loading required package: rngtools
#> Loading required package: cluster
#> NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 71/72
#>   To enable shared memory capabilities, try: install.extras('
#> NMF
#> ')

Load the dataset

The rdata file is stored at "your_git_repository/tutorial.dat/tutorial.gbm.rdata"

load("../tutorial.dat/tutorial.gbm.rdata")
ls()
#> [1] "bk.dat"            "cell.state.labels" "cell.type.labels"  "sc.dat"
dim(bk.dat)
#> [1]   169 60483
head(rownames(bk.dat))
#> [1] "TCGA-06-2563-01A-01R-1849-01" "TCGA-06-0749-01A-01R-1849-01" "TCGA-06-5418-01A-01R-1849-01" "TCGA-06-0211-01B-01R-1849-01" "TCGA-19-2625-01A-01R-1850-01" "TCGA-19-4065-02A-11R-2005-01"
head(colnames(bk.dat))
#> [1] "ENSG00000000003.13" "ENSG00000000005.5"  "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11"

The rdata file contains four objects required for running BayesPrism.

  • bk.dat: The sample-by-gene raw count matrix of bulk RNA-seq expression. rownames are bulk sample IDs, while colnames are gene names/IDs.
dim(bk.dat)
#> [1]   169 60483
head(rownames(bk.dat))
#> [1] "TCGA-06-2563-01A-01R-1849-01" "TCGA-06-0749-01A-01R-1849-01" "TCGA-06-5418-01A-01R-1849-01" "TCGA-06-0211-01B-01R-1849-01" "TCGA-19-2625-01A-01R-1850-01" "TCGA-19-4065-02A-11R-2005-01"
head(colnames(bk.dat))
#> [1] "ENSG00000000003.13" "ENSG00000000005.5"  "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11"
  • sc.dat: The cell-by-gene raw count matrix of bulk RNA-seq expression. rownames are bulk cell IDs, while colnames are gene names/IDs.
dim(sc.dat)
#> [1] 23793 60294
head(rownames(sc.dat))
#> [1] "PJ016.V3" "PJ016.V4" "PJ016.V5" "PJ016.V6" "PJ016.V7" "PJ016.V8"
head(colnames(sc.dat))
#> [1] "ENSG00000130876.10" "ENSG00000134438.9"  "ENSG00000269696.1"  "ENSG00000230393.1"  "ENSG00000266744.1"  "ENSG00000202281.1"
  • Note that BayesPrism will perform deconvolution over genes shared between the mixture and scRNA-seq reference, i.e., by intersecting colnames(mixture) and colnames(reference). Therefore, please make sure to use consistent gene annotations (TCGA RNA-seq uses GENCODE v22).

  • We recommend the use of unnormalized and untransformed count data. When raw count is not available, linear normalization, such as TPM, RPM, RPKM, FPKM, is also acceptable, as BayesPrism is robust to linear multiplicative difference between the reference and mixture. Ideally, if using normalized data, it is the best to supply reference and bulk transformed by the same method. log transformation should be avoided.

  • cell.type.labels is a character vector of the same length as nrow(sc.dat) to denote the cell type of each cell in the reference.

sort(table(cell.type.labels))
#> cell.type.labels
#>       tcell       oligo    pericyte endothelial     myeloid       tumor 
#>          67         160         489         492        2505       20080
  • cell.state.labels is a character vector of the same length as nrow(sc.dat) to denote the cell state of each cell in the reference. In our example, cell states of malignant cells were obtained by sub-clustering the malignant cells from each patient, and cell states of myeloid cells were obtained by clustering myeloid cells from all patients. We define multiple cell states for these two cell types, as they contain substantial heterogeneity while also having sufficient number of cells for sub-clustering.
sort(table(cell.state.labels))
#> cell.state.labels
#> PJ017-tumor-6 PJ032-tumor-5     myeloid_8 PJ032-tumor-4 PJ032-tumor-3 PJ017-tumor-5         tcell PJ032-tumor-2 PJ030-tumor-5     myeloid_7 PJ035-tumor-8 PJ017-tumor-4 PJ017-tumor-3 PJ017-tumor-2 PJ017-tumor-0 PJ017-tumor-1 PJ018-tumor-5     myeloid_6     myeloid_5 PJ035-tumor-7         oligo PJ048-tumor-8 PJ032-tumor-1 PJ032-tumor-0 PJ048-tumor-7 PJ025-tumor-9 PJ035-tumor-6 PJ048-tumor-6 PJ016-tumor-6 PJ018-tumor-4     myeloid_4 PJ048-tumor-5 PJ048-tumor-4 PJ016-tumor-5 PJ025-tumor-8 PJ048-tumor-3 PJ035-tumor-5 PJ030-tumor-4 PJ018-tumor-3 PJ016-tumor-4 PJ025-tumor-7     myeloid_3 PJ035-tumor-4     myeloid_2 PJ025-tumor-6 PJ018-tumor-2 PJ030-tumor-3 PJ016-tumor-3 PJ025-tumor-5 PJ030-tumor-2 PJ018-tumor-1 PJ035-tumor-3 PJ048-tumor-2 PJ018-tumor-0 PJ048-tumor-1 PJ035-tumor-2 PJ035-tumor-1 PJ016-tumor-2 PJ030-tumor-1      pericyte   endothelial PJ035-tumor-0 PJ025-tumor-4     myeloid_1 PJ048-tumor-0     myeloid_0 PJ030-tumor-0 PJ025-tumor-3 PJ016-tumor-1 PJ016-tumor-0 PJ025-tumor-2 PJ025-tumor-1 PJ025-tumor-0 
#>            22            41            49            57            62            64            67            72            73            75            81            83            89           101           107           107           113           130           141           150           160           169           171           195           228           236           241           244           261           262           266           277           303           308           319           333           334           348           361           375           381           382           385           386           397           403           419           420           421           425           429           435           437           444           463           471           474           481           482           489           492           512           523           526           545           550           563           601           619           621           630           941           971
table(cbind.data.frame(cell.state.labels, cell.type.labels))
#>                  cell.type.labels
#> cell.state.labels endothelial myeloid oligo pericyte tcell tumor
#>     endothelial           492       0     0        0     0     0
#>     myeloid_0               0     550     0        0     0     0
#>     myeloid_1               0     526     0        0     0     0
#>     myeloid_2               0     386     0        0     0     0
#>     myeloid_3               0     382     0        0     0     0
#>     myeloid_4               0     266     0        0     0     0
#>     myeloid_5               0     141     0        0     0     0
#>     myeloid_6               0     130     0        0     0     0
#>     myeloid_7               0      75     0        0     0     0
#>     myeloid_8               0      49     0        0     0     0
#>     oligo                   0       0   160        0     0     0
#>     pericyte                0       0     0      489     0     0
#>     PJ016-tumor-0           0       0     0        0     0   621
#>     PJ016-tumor-1           0       0     0        0     0   619
#>     PJ016-tumor-2           0       0     0        0     0   481
#>     PJ016-tumor-3           0       0     0        0     0   420
#>     PJ016-tumor-4           0       0     0        0     0   375
#>     PJ016-tumor-5           0       0     0        0     0   308
#>     PJ016-tumor-6           0       0     0        0     0   261
#>     PJ017-tumor-0           0       0     0        0     0   107
#>     PJ017-tumor-1           0       0     0        0     0   107
#>     PJ017-tumor-2           0       0     0        0     0   101
#>     PJ017-tumor-3           0       0     0        0     0    89
#>     PJ017-tumor-4           0       0     0        0     0    83
#>     PJ017-tumor-5           0       0     0        0     0    64
#>     PJ017-tumor-6           0       0     0        0     0    22
#>     PJ018-tumor-0           0       0     0        0     0   444
#>     PJ018-tumor-1           0       0     0        0     0   429
#>     PJ018-tumor-2           0       0     0        0     0   403
#>     PJ018-tumor-3           0       0     0        0     0   361
#>     PJ018-tumor-4           0       0     0        0     0   262
#>     PJ018-tumor-5           0       0     0        0     0   113
#>     PJ025-tumor-0           0       0     0        0     0   971
#>     PJ025-tumor-1           0       0     0        0     0   941
#>     PJ025-tumor-2           0       0     0        0     0   630
#>     PJ025-tumor-3           0       0     0        0     0   601
#>     PJ025-tumor-4           0       0     0        0     0   523
#>     PJ025-tumor-5           0       0     0        0     0   421
#>     PJ025-tumor-6           0       0     0        0     0   397
#>     PJ025-tumor-7           0       0     0        0     0   381
#>     PJ025-tumor-8           0       0     0        0     0   319
#>     PJ025-tumor-9           0       0     0        0     0   236
#>     PJ030-tumor-0           0       0     0        0     0   563
#>     PJ030-tumor-1           0       0     0        0     0   482
#>     PJ030-tumor-2           0       0     0        0     0   425
#>     PJ030-tumor-3           0       0     0        0     0   419
#>     PJ030-tumor-4           0       0     0        0     0   348
#>     PJ030-tumor-5           0       0     0        0     0    73
#>     PJ032-tumor-0           0       0     0        0     0   195
#>     PJ032-tumor-1           0       0     0        0     0   171
#>     PJ032-tumor-2           0       0     0        0     0    72
#>     PJ032-tumor-3           0       0     0        0     0    62
#>     PJ032-tumor-4           0       0     0        0     0    57
#>     PJ032-tumor-5           0       0     0        0     0    41
#>     PJ035-tumor-0           0       0     0        0     0   512
#>     PJ035-tumor-1           0       0     0        0     0   474
#>     PJ035-tumor-2           0       0     0        0     0   471
#>     PJ035-tumor-3           0       0     0        0     0   435
#>     PJ035-tumor-4           0       0     0        0     0   385
#>     PJ035-tumor-5           0       0     0        0     0   334
#>     PJ035-tumor-6           0       0     0        0     0   241
#>     PJ035-tumor-7           0       0     0        0     0   150
#>     PJ035-tumor-8           0       0     0        0     0    81
#>     PJ048-tumor-0           0       0     0        0     0   545
#>     PJ048-tumor-1           0       0     0        0     0   463
#>     PJ048-tumor-2           0       0     0        0     0   437
#>     PJ048-tumor-3           0       0     0        0     0   333
#>     PJ048-tumor-4           0       0     0        0     0   303
#>     PJ048-tumor-5           0       0     0        0     0   277
#>     PJ048-tumor-6           0       0     0        0     0   244
#>     PJ048-tumor-7           0       0     0        0     0   228
#>     PJ048-tumor-8           0       0     0        0     0   169
#>     tcell                   0       0     0        0    67     0
  • Please make sure that all cell states contain a reasonable number of cells, e.g. >20 or >50, so that their profile can be represented accurately.

  • What to supply for cell.state.labels and cell.type.labels? The definition of cell type and cell state can be somewhat arbitrary (similar to the issue of assigning cell types for scRNA-seq) and depends on the question of interest. Their definitions depend on the granularity we aim at and the confidence of the cell.type.labels in scRNA-seq data. Usually, a good rule of thumb is as follows. 1) Define cell types as the cluster of cells having a sufficient number of significantly differentially expressed genes than other cell types, e.g., greater than 50 or even 100. For clusters that are too similar in transcription, we recommend treating them as cell states, which will be summed up before the final Gibbs sampling. Therefore, cell states are often suitable for cells that form a continuum on the phenotypic manifold rather than distinct clusters. 2) Define multiple cell states for cell types of significant heterogeneity, such as malignant cells, and of interest to deconvolve their transcription.

QC of cell type and state labels

We recommend first plotting the pairwise correlation matrix between cell states and between cell types. This will give us a sense of their quality. In cases where cell types/states are not represented by sufficient amount of information (low cell count and/or low library size), the low-quality cell types/states tend to cluster together. Users may re-cluster the data at higher granularity, or merge those cell types/states with the most similar cell types/states, or remove them (if re-clustering and merging is not appropriate).

plot.cor.phi (input=sc.dat,
                         input.labels=cell.state.labels,
                         title="cell state correlation",
                         #specify pdf.prefix if need to output to pdf
                         #pdf.prefix="gbm.cor.cs", 
                         cexRow=0.2, cexCol=0.2,
                         margins=c(2,2))

plot.cor.phi (input=sc.dat, 
                         input.labels=cell.type.labels, 
                         title="cell type correlation",
                         #specify pdf.prefix if need to output to pdf
                         #pdf.prefix="gbm.cor.ct",
                         cexRow=0.5, cexCol=0.5,
                         )