Note: if you use BayesPrism in published research, please cite:
Chu, T., Wang, Z., Pe’er, D. Danko, C. G. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat Cancer 3, 505–517 (2022). https://doi.org/10.1038/s43018-022-00356-3
Introduction
BayesPrism performs cell type and gene expression deconvolution for bulk RNA-seq (and spatial transcriptomics) using scRNA-seq of samples collected from matched or similar tissue type. It treats the scRNA-seq as prior information, and estimates \(\mathrm{P}(\theta, Z | X, \phi)\) , i.e. the joint posterior distribution of cell type fraction \(\theta\) and cell type-specific gene expression \(Z\) in each bulk conditional on the reference \(\phi\) and each observed bulk \(X\).
We will demonstrate how to use BayesPrism to deconvolve the bulk RNA-seq dataset TCGA-GBM, using the scRNA-seq dataset collected from 8 hold-out cohorts by Yuan et al.. (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-018-0567-9) .
Load the BayesPrism package
suppressWarnings(library(BayesPrism))
#> Loading required package: snowfall
#> Loading required package: snow
#> Loading required package: NMF
#> Loading required package: pkgmaker
#> Loading required package: registry
#> Loading required package: rngtools
#> Loading required package: cluster
#> NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 71/72
#> To enable shared memory capabilities, try: install.extras('
#> NMF
#> ')
Load the dataset
The rdata file is stored at
"your_git_repository/tutorial.dat/tutorial.gbm.rdata"
load("../tutorial.dat/tutorial.gbm.rdata")
ls()
#> [1] "bk.dat" "cell.state.labels" "cell.type.labels" "sc.dat"
dim(bk.dat)
#> [1] 169 60483
head(rownames(bk.dat))
#> [1] "TCGA-06-2563-01A-01R-1849-01" "TCGA-06-0749-01A-01R-1849-01" "TCGA-06-5418-01A-01R-1849-01" "TCGA-06-0211-01B-01R-1849-01" "TCGA-19-2625-01A-01R-1850-01" "TCGA-19-4065-02A-11R-2005-01"
head(colnames(bk.dat))
#> [1] "ENSG00000000003.13" "ENSG00000000005.5" "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11"
The rdata file contains four objects required for running BayesPrism.
bk.dat
: The sample-by-gene raw count matrix of bulk RNA-seq expression.rownames
are bulk sample IDs, whilecolnames
are gene names/IDs.
dim(bk.dat)
#> [1] 169 60483
head(rownames(bk.dat))
#> [1] "TCGA-06-2563-01A-01R-1849-01" "TCGA-06-0749-01A-01R-1849-01" "TCGA-06-5418-01A-01R-1849-01" "TCGA-06-0211-01B-01R-1849-01" "TCGA-19-2625-01A-01R-1850-01" "TCGA-19-4065-02A-11R-2005-01"
head(colnames(bk.dat))
#> [1] "ENSG00000000003.13" "ENSG00000000005.5" "ENSG00000000419.11" "ENSG00000000457.12" "ENSG00000000460.15" "ENSG00000000938.11"
sc.dat
: The cell-by-gene raw count matrix of bulk RNA-seq expression.rownames
are bulk cell IDs, whilecolnames
are gene names/IDs.
dim(sc.dat)
#> [1] 23793 60294
head(rownames(sc.dat))
#> [1] "PJ016.V3" "PJ016.V4" "PJ016.V5" "PJ016.V6" "PJ016.V7" "PJ016.V8"
head(colnames(sc.dat))
#> [1] "ENSG00000130876.10" "ENSG00000134438.9" "ENSG00000269696.1" "ENSG00000230393.1" "ENSG00000266744.1" "ENSG00000202281.1"
Note that BayesPrism will perform deconvolution over genes shared between the mixture and scRNA-seq reference, i.e., by intersecting colnames(mixture) and colnames(reference). Therefore, please make sure to use consistent gene annotations (TCGA RNA-seq uses GENCODE v22).
We recommend the use of unnormalized and untransformed count data. When raw count is not available, linear normalization, such as TPM, RPM, RPKM, FPKM, is also acceptable, as BayesPrism is robust to linear multiplicative difference between the reference and mixture. Ideally, if using normalized data, it is the best to supply reference and bulk transformed by the same method. log transformation should be avoided.
cell.type.labels
is a character vector of the same length asnrow(sc.dat)
to denote the cell type of each cell in the reference.
sort(table(cell.type.labels))
#> cell.type.labels
#> tcell oligo pericyte endothelial myeloid tumor
#> 67 160 489 492 2505 20080
cell.state.labels
is a character vector of the same length asnrow(sc.dat)
to denote the cell state of each cell in the reference. In our example, cell states of malignant cells were obtained by sub-clustering the malignant cells from each patient, and cell states of myeloid cells were obtained by clustering myeloid cells from all patients. We define multiple cell states for these two cell types, as they contain substantial heterogeneity while also having sufficient number of cells for sub-clustering.
sort(table(cell.state.labels))
#> cell.state.labels
#> PJ017-tumor-6 PJ032-tumor-5 myeloid_8 PJ032-tumor-4 PJ032-tumor-3 PJ017-tumor-5 tcell PJ032-tumor-2 PJ030-tumor-5 myeloid_7 PJ035-tumor-8 PJ017-tumor-4 PJ017-tumor-3 PJ017-tumor-2 PJ017-tumor-0 PJ017-tumor-1 PJ018-tumor-5 myeloid_6 myeloid_5 PJ035-tumor-7 oligo PJ048-tumor-8 PJ032-tumor-1 PJ032-tumor-0 PJ048-tumor-7 PJ025-tumor-9 PJ035-tumor-6 PJ048-tumor-6 PJ016-tumor-6 PJ018-tumor-4 myeloid_4 PJ048-tumor-5 PJ048-tumor-4 PJ016-tumor-5 PJ025-tumor-8 PJ048-tumor-3 PJ035-tumor-5 PJ030-tumor-4 PJ018-tumor-3 PJ016-tumor-4 PJ025-tumor-7 myeloid_3 PJ035-tumor-4 myeloid_2 PJ025-tumor-6 PJ018-tumor-2 PJ030-tumor-3 PJ016-tumor-3 PJ025-tumor-5 PJ030-tumor-2 PJ018-tumor-1 PJ035-tumor-3 PJ048-tumor-2 PJ018-tumor-0 PJ048-tumor-1 PJ035-tumor-2 PJ035-tumor-1 PJ016-tumor-2 PJ030-tumor-1 pericyte endothelial PJ035-tumor-0 PJ025-tumor-4 myeloid_1 PJ048-tumor-0 myeloid_0 PJ030-tumor-0 PJ025-tumor-3 PJ016-tumor-1 PJ016-tumor-0 PJ025-tumor-2 PJ025-tumor-1 PJ025-tumor-0
#> 22 41 49 57 62 64 67 72 73 75 81 83 89 101 107 107 113 130 141 150 160 169 171 195 228 236 241 244 261 262 266 277 303 308 319 333 334 348 361 375 381 382 385 386 397 403 419 420 421 425 429 435 437 444 463 471 474 481 482 489 492 512 523 526 545 550 563 601 619 621 630 941 971
table(cbind.data.frame(cell.state.labels, cell.type.labels))
#> cell.type.labels
#> cell.state.labels endothelial myeloid oligo pericyte tcell tumor
#> endothelial 492 0 0 0 0 0
#> myeloid_0 0 550 0 0 0 0
#> myeloid_1 0 526 0 0 0 0
#> myeloid_2 0 386 0 0 0 0
#> myeloid_3 0 382 0 0 0 0
#> myeloid_4 0 266 0 0 0 0
#> myeloid_5 0 141 0 0 0 0
#> myeloid_6 0 130 0 0 0 0
#> myeloid_7 0 75 0 0 0 0
#> myeloid_8 0 49 0 0 0 0
#> oligo 0 0 160 0 0 0
#> pericyte 0 0 0 489 0 0
#> PJ016-tumor-0 0 0 0 0 0 621
#> PJ016-tumor-1 0 0 0 0 0 619
#> PJ016-tumor-2 0 0 0 0 0 481
#> PJ016-tumor-3 0 0 0 0 0 420
#> PJ016-tumor-4 0 0 0 0 0 375
#> PJ016-tumor-5 0 0 0 0 0 308
#> PJ016-tumor-6 0 0 0 0 0 261
#> PJ017-tumor-0 0 0 0 0 0 107
#> PJ017-tumor-1 0 0 0 0 0 107
#> PJ017-tumor-2 0 0 0 0 0 101
#> PJ017-tumor-3 0 0 0 0 0 89
#> PJ017-tumor-4 0 0 0 0 0 83
#> PJ017-tumor-5 0 0 0 0 0 64
#> PJ017-tumor-6 0 0 0 0 0 22
#> PJ018-tumor-0 0 0 0 0 0 444
#> PJ018-tumor-1 0 0 0 0 0 429
#> PJ018-tumor-2 0 0 0 0 0 403
#> PJ018-tumor-3 0 0 0 0 0 361
#> PJ018-tumor-4 0 0 0 0 0 262
#> PJ018-tumor-5 0 0 0 0 0 113
#> PJ025-tumor-0 0 0 0 0 0 971
#> PJ025-tumor-1 0 0 0 0 0 941
#> PJ025-tumor-2 0 0 0 0 0 630
#> PJ025-tumor-3 0 0 0 0 0 601
#> PJ025-tumor-4 0 0 0 0 0 523
#> PJ025-tumor-5 0 0 0 0 0 421
#> PJ025-tumor-6 0 0 0 0 0 397
#> PJ025-tumor-7 0 0 0 0 0 381
#> PJ025-tumor-8 0 0 0 0 0 319
#> PJ025-tumor-9 0 0 0 0 0 236
#> PJ030-tumor-0 0 0 0 0 0 563
#> PJ030-tumor-1 0 0 0 0 0 482
#> PJ030-tumor-2 0 0 0 0 0 425
#> PJ030-tumor-3 0 0 0 0 0 419
#> PJ030-tumor-4 0 0 0 0 0 348
#> PJ030-tumor-5 0 0 0 0 0 73
#> PJ032-tumor-0 0 0 0 0 0 195
#> PJ032-tumor-1 0 0 0 0 0 171
#> PJ032-tumor-2 0 0 0 0 0 72
#> PJ032-tumor-3 0 0 0 0 0 62
#> PJ032-tumor-4 0 0 0 0 0 57
#> PJ032-tumor-5 0 0 0 0 0 41
#> PJ035-tumor-0 0 0 0 0 0 512
#> PJ035-tumor-1 0 0 0 0 0 474
#> PJ035-tumor-2 0 0 0 0 0 471
#> PJ035-tumor-3 0 0 0 0 0 435
#> PJ035-tumor-4 0 0 0 0 0 385
#> PJ035-tumor-5 0 0 0 0 0 334
#> PJ035-tumor-6 0 0 0 0 0 241
#> PJ035-tumor-7 0 0 0 0 0 150
#> PJ035-tumor-8 0 0 0 0 0 81
#> PJ048-tumor-0 0 0 0 0 0 545
#> PJ048-tumor-1 0 0 0 0 0 463
#> PJ048-tumor-2 0 0 0 0 0 437
#> PJ048-tumor-3 0 0 0 0 0 333
#> PJ048-tumor-4 0 0 0 0 0 303
#> PJ048-tumor-5 0 0 0 0 0 277
#> PJ048-tumor-6 0 0 0 0 0 244
#> PJ048-tumor-7 0 0 0 0 0 228
#> PJ048-tumor-8 0 0 0 0 0 169
#> tcell 0 0 0 0 67 0
Please make sure that all cell states contain a reasonable number of cells, e.g. >20 or >50, so that their profile can be represented accurately.
What to supply for cell.state.labels and cell.type.labels? The definition of cell type and cell state can be somewhat arbitrary (similar to the issue of assigning cell types for scRNA-seq) and depends on the question of interest. Their definitions depend on the granularity we aim at and the confidence of the cell.type.labels in scRNA-seq data. Usually, a good rule of thumb is as follows. 1) Define cell types as the cluster of cells having a sufficient number of significantly differentially expressed genes than other cell types, e.g., greater than 50 or even 100. For clusters that are too similar in transcription, we recommend treating them as cell states, which will be summed up before the final Gibbs sampling. Therefore, cell states are often suitable for cells that form a continuum on the phenotypic manifold rather than distinct clusters. 2) Define multiple cell states for cell types of significant heterogeneity, such as malignant cells, and of interest to deconvolve their transcription.
QC of cell type and state labels
We recommend first plotting the pairwise correlation matrix between cell states and between cell types. This will give us a sense of their quality. In cases where cell types/states are not represented by sufficient amount of information (low cell count and/or low library size), the low-quality cell types/states tend to cluster together. Users may re-cluster the data at higher granularity, or merge those cell types/states with the most similar cell types/states, or remove them (if re-clustering and merging is not appropriate).
plot.cor.phi (input=sc.dat,
input.labels=cell.state.labels,
title="cell state correlation",
#specify pdf.prefix if need to output to pdf
#pdf.prefix="gbm.cor.cs",
cexRow=0.2, cexCol=0.2,
margins=c(2,2))
plot.cor.phi (input=sc.dat,
input.labels=cell.type.labels,
title="cell type correlation",
#specify pdf.prefix if need to output to pdf
#pdf.prefix="gbm.cor.ct",
cexRow=0.5, cexCol=0.5,
)