比如世界常见的非负矩阵剖释、一致性聚类、PCA等
设为“星标”,精彩可以过
本文主要参考官方先容:https://xlucpu.github.io/MOVICS/MOVICS-VIGNETTE.html
简介
装配
GET Module
准备数据
筛选基因(降维)
详情最好亚型数目
天津大桥焊材集团有限公司字据单一算法分型
同期进行多种分型算法
整合多种分型死心
检察分型死心的质料
多组学分型热图
简介分子分型一直是生信数据挖掘的热点妙技,用于分子分型的算法终点多,比如世界常见的非负矩阵剖释、一致性聚类、PCA等,一致性聚类咱们在之前也先容过了:免疫浸润死心分子分型
今天给世界先容一个一站式的分子分型R包:MOVICS。
该包与其他分子分型R包最大的不同是它能同期使用多组学的数据,平日的分子分型R包只可通过一种组学数据进行分析,比如只可通过mRNA的抒发矩阵进行分析。然而这R包它可以同期通过比如说mRNA、lncRNA、甲基化数据、突变数据进行分型。
以外,它还提供了分型之后每个亚型的探索以及每个亚型内的分析。是以说这是一个一站式的包。这个的功能主要分为三个部分,暗示图如下:
图片
第一个部分是字据不同的组学数据进行分型。第二个部分是比拟不同的分型。第三个部分是对每个分型进行探索,以及得到每个分型特异性的分子。
每个部分包含的主要函数如下,底下会先容:
GET Module: get subtypes through multi-omics integrative clustering
getElites(): get elites which are those features that pass the filtering procedure and are used for analysesgetClustNum(): get optimal cluster number by calculating clustering prediction index (CPI) and Gap-statisticsgetalgorithm_name(): get results from one specific multi-omics integrative clustering algorithm with detailed parametersgetMOIC(): get a list of results from multiple multi-omics integrative clustering algorithm with parameters by defaultgetConsensusMOIC(): get a consensus matrix that indicates the clustering robustness across different clustering algorithms and generate a consensus heatmapgetSilhouette(): get quantification of sample similarity using silhoutte score approachgetStdiz(): get a standardized data for generating comprehensive multi-omics heatmapgetMoHeatmap(): get a comprehensive multi-omics heatmap based on clustering resultsCOMP Module: compare subtypes from multiple perspectives
compSurv(): compare survival outcome and generate a Kalan-Meier curve with pairwise comparison if possiblecompClinvar(): compare and summarize clinical features among different identified subtypescompMut(): compare mutational frequency and generate an OncoPrint with significant mutationscompTMB(): compare total mutation burden among subtypes and generate distribution of Transitions and TransversionscompFGA(): compare fraction genome altered among subtypes and generate a barplot for distribution comparisoncompDrugsen(): compare estimated half maximal inhibitory concentration (IC50 ) for drug sensitivity and generate a boxviolin for distribution comparisoncompAgree(): compare agreement of current subtypes with other pre-existed classifications and generate an alluvial diagram and an agreement barplotRUN Module: run marker identification and verify subtypes
runDEA(): run differential expression analysis with three popular methods for choosing, including edgeR, DESeq2, and limmarunMarker(): run biomarker identification to determine uniquely and significantly differential expressed genes for each subtyperunGSEA(): run gene set enrichment analysis (GSEA), calculate activity of functional pathways and generate a pathway-specific heatmaprunGSVA(): run gene set variation analysis to calculate enrichment score of each sample based on given gene set list of interestrunNTP(): run nearest template prediction based on identified biomarkers to evaluate subtypes in external cohortsrunPAM(): run partition around medoids classifier based on discovery cohort to predict subtypes in external cohortsrunKappa(): run consistency evaluation using Kappa statistics between two appraisements that identify or predict current subtypes该包已发表,使用时铭刻援用:
Lu, X., Meng, J., Zhou, Y., Jiang, L., and Yan, F. (2020). MOVICS: an R package for multi-omics integration and visualization in cancer subtyping. bioRxiv, 2020.2009.2015.297820. [doi.org/10.1101/2020.09.15.297820]装配当今该包在github,只可通过以下神色装配,矜重装配时最好先装配依赖包,因为这个包的依赖包终点多,装配经由中终点容易失败。对于入门者来说,这个包的装配不是很友好哦~
# 集会装配devtools::install_github("xlucpu/MOVICS")# 好像下载到腹地装配devtools::install_local("E:/R/R包/MOVICS-master.zip")GET Module准备数据
咱们先看一下示例数据。
library(MOVICS)##
使用该包自带数据进行演示,这个自带数据是照旧清洗好的。过几天再专门写一篇推文先容怎样准备这个数据。
# TCGA的乳腺癌数据load(system.file("extdata", "brca.tcga.RData", package = "MOVICS", mustWork = TRUE))load(system.file("extdata", "brca.yau.RData", package = "MOVICS", mustWork = TRUE))
brca.tcga内部是多个组学的数据,比如mRNA、lncRNA、甲基化、突变数据等,还有临床信息,比如糊口时辰和糊口景色以及乳腺癌的PAM50分类。
为了演示,这个数据通过MAD筛选了部分数据:
500 mRNAs,500 lncRNA,1,000 promoter CGI probes/genes with high variation30 genes that mutated in at least 3% of the entire cohort.矜重,这里最蹙迫的极少是:每种组学的数据的样本数目、名字、王法应该皆备一致。世界可以我方看一下这些数据是什么样的。
names(brca.tcga)## [1] "mRNA.expr" "lncRNA.expr" "meth.beta" "mut.status" "count" ## [6] "fpkm" "maf" "segment" "clin.info"names(brca.yau)## [1] "mRNA.expr" "clin.info"# 索要"mRNA.expr""lncRNA.expr""meth.beta""mut.status"mo.data <- brca.tcga[1:4]# 索要raw count datacount <- brca.tcga$count# 索要fpkm datafpkm <- brca.tcga$fpkm# 索要mafmaf <- brca.tcga$maf# 索要segmented copy numbersegment <- brca.tcga$segment# 索要糊口信息surv.info <- brca.tcga$clin.info筛选基因(降维)
getElites,顾名想义,找出精英,找出最给力的,也即是说这个函数可以作念一些预措置和筛选责任,可以帮你进行数据准备责任。
主要可以作念以下预措置:
缺失值插补:径直删除好像knn插补筛选分子:可字据mad, sd, pca, cox, freq(二分类数据)进行筛选其实这个不是第一步,第一步应该是我方先清洗一下数据,比如抒发矩阵先进行log退换等。
底下是一些功能演示,还短长常宏大的。
缺失值插补:
# scenario 1: 措置缺失值tmp <- brca.tcga$mRNA.expr # get expression datadim(tmp) # check data dimension## [1] 500 643tmp[1,1] <- tmp[2,2] <- NA # 添加几个NAtmp[1:3,1:3] # check data## BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A## SCGB2A2 NA 1.42 7.24## SCGB1D2 10.11 NA 5.88## PIP 4.54 2.59 4.35elite.tmp <- getElites(dat = tmp, method = "mad", na.action = "rm", # 径直删除 elite.pct = 1) # 保留100%的数据## --2 features with NA values are removed.## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) ## [1] 498 643elite.tmp <- getElites(dat = tmp, method = "mad", na.action = "impute", # 使用knn进行插补 elite.pct = 1) ## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) ## [1] 500 643elite.tmp$elite.dat[1:3,1:3] # NA values have been imputed ## BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A## SCGB2A2 6.867 1.420 7.24## SCGB1D2 10.110 4.739 5.88## PIP 4.540 2.590 4.35
使用MAD筛选分子:
# scenario 2: 使用MAD筛选,最大中位差tmp <- brca.tcga$mRNA.expr elite.tmp <- getElites(dat = tmp, method = "mad", elite.pct = 0.1) # 保留MAD前10%的基因## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) # 500的10%是50## [1] 50 643#> [1] 50 643elite.tmp <- getElites(dat = tmp, method = "sd", elite.num = 100, # 保留MAD前100的基因 elite.pct = 0.1) # 此时这个参数就不起作用了## elite.num has been provided then discards elite.pct.dim(elite.tmp$elite.dat) ## [1] 100 643
使用PCA筛选分子,需要了解一些对于PCA的基础学问:R谈话主因素分析
# scenario 3: 使用PCA筛选分子tmp <- brca.tcga$mRNA.expr # get expression data with 500 featureselite.tmp <- getElites(dat = tmp, method = "pca", pca.ratio = 0.95) # 主因素的比例## --the ratio used to select principal component is set as 0.95dim(elite.tmp$elite.dat) # get 204 elite (PCs) left## [1] 204 643
使用单因素COX总结筛选分子,也即是对每个分子作念单因素cox分析,聘请专诚想的留住,需要提供糊口信息:
# scenario 4: 使用cox筛选分子tmp <- brca.tcga$mRNA.expr # get expression data elite.tmp <- getElites(dat = tmp, method = "cox", surv.info = surv.info, # 糊口信息,列名必须有'futime'和'fustat' p.cutoff = 0.05, elite.num = 100) # 此时这个参数亦然不起作用的## --all sample matched between omics matrix and survival data.## 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%dim(elite.tmp$elite.dat) # get 125 elites## [1] 125 643table(elite.tmp$unicox$pvalue < 0.05) # 125 genes have nominal pvalue < 0.05 in ## ## FALSE TRUE ## 375 125tmp <- brca.tcga$mut.status # get mutation data elite.tmp <- getElites(dat = tmp, method = "cox", surv.info = surv.info, p.cutoff = 0.05, elite.num = 100) ## --all sample matched between omics matrix and survival data.## 7% 13% 20% 27% 33% 40% 47% 53% 60% 67% 73% 80% 87% 93% 100%dim(elite.tmp$elite.dat) # get 3 elites## [1] 3 643table(elite.tmp$unicox$pvalue < 0.05) # 3 mutations have nominal pvalue < 0.05## ## FALSE TRUE ## 27 3
使用突变频率筛选分子,这个是准们用于0/1矩阵这种二分类数据的:
# scenario 5: 使用突变频率筛选tmp <- brca.tcga$mut.status # get mutation data rowSums(tmp) ## PIK3CA TP53 TTN CDH1 GATA3 MLL3 MUC16 MAP3K1 SYNE1 MUC12 DMD ## 208 186 111 83 58 49 48 38 33 32 31 ## NCOR1 FLG PTEN RYR2 USH2A SPTA1 MAP2K4 MUC5B NEB SPEN MACF1 ## 31 30 29 27 27 25 25 24 24 23 23 ## RYR3 DST HUWE1 HMCN1 CSMD1 OBSCN APOB SYNE2 ## 23 22 22 22 21 21 21 21elite.tmp <- getElites(dat = tmp, method = "freq", # must set as 'freq' elite.num = 80, # 这里是指突变频率 elite.pct = 0.1) # 此时该参数不起作用## --method of 'freq' only supports binary omics data (e.g., somatic mutation matrix), and in this manner, elite.pct and elite.num are used to cut frequency.## elite.num has been provided then discards elite.pct.rowSums(elite.tmp$elite.dat) # 只保留在80个及以上样本中突变的基因## PIK3CA TP53 TTN CDH1 ## 208 186 111 83elite.tmp <- getElites(dat = tmp, method = "freq", elite.pct = 0.2) ## --method of 'freq' only supports binary omics data (e.g., somatic mutation matrix), and in this manner, elite.pct and elite.num are used to cut frequency.## missing elite.num then use elite.pctrowSums(elite.tmp$elite.dat) # only genes that are mutated in over than 0.2*643=128.6 ## PIK3CA TP53 ## 208 186详情最好亚型数目
字据分子抒发量对样本进行分型,分子即是上一步得到的mRNA、lncRNA、miRNA、甲基化矩阵等。
先字据CPI和Gaps-statistics详情分红几个亚型:
optk.brca <- getClustNum(data = mo.data, # 4种组学数据 is.binary = c(F,F,F,T), #前3个不是二分类的,临了一个是 try.N.clust = 2:8,平乐县粒为搪瓷有限公司 # 尝试亚型数目, 浙江台州市王野动力有限公司从2到8 fig.name = "CLUSTER NUMBER OF TCGA-BRCA")#保存的文献名## calculating Cluster Prediction Index...## 5% complete## 5% complete## 10% complete## 10% complete## 15% complete## 15% complete## 20% complete## 25% complete## 25% complete## 30% complete## 30% complete## 35% complete## 35% complete## 40% complete## 45% complete## 45% complete## 50% complete## 50% complete## 55% complete## 55% complete## 60% complete## 65% complete## 65% complete## 70% complete## 70% complete## 75% complete## 75% complete## 80% complete## 85% complete## 85% complete## 90% complete## 90% complete## 95% complete## 95% complete## 100% complete## calculating Gap-statistics...## visualization done...## --the imputed optimal cluster number is 3 arbitrarily, but it would be better referring to other priori knowledge.
图片
unnamed-chunk-10-186542957会自动在面前责任目次下产生一个PDF口头的图片。
函数给出的死心是3,然而筹商到乳腺癌的PAM0分类,咱们聘请k=5,也即是分红5个亚型。
是以这个详情最好亚型个数是字据你我方的需要来的哈,纯真诊治~
字据单一算法分型详情分红几个亚型之后,可以通过算法进行分型了。提供了终点多的门径,世界常见的非负矩阵剖释、异质性聚类等等都提供了。
比如字据贝叶斯门径进行分型:
# perform iClusterBayes (may take a while)iClusterBayes.res <- getiClusterBayes(data = mo.data, N.clust = 5, type = c("gaussian","gaussian","gaussian","binomial"), n.burnin = 1800, n.draw = 1200, prior.gamma = c(0.5, 0.5, 0.5, 0.5), sdev = 0.05, thin = 3)## clustering done...## feature selection done...
好像使用协调的函数,我方聘请门径即可,两种门径得到的死心皆备是相通的:
iClusterBayes.res <- getMOIC(data = mo.data, N.clust = 5, methodslist = "iClusterBayes", # 指定算法 type = c("gaussian","gaussian","gaussian","binomial"), # data type corresponding to the list n.burnin = 1800, n.draw = 1200, prior.gamma = c(0.5, 0.5, 0.5, 0.5), sdev = 0.05, thin = 3)
复返的死心包含一个clust.res对象,它有两列:clust列带领样本所属的亚型,samID列记载对应的样真称号。对于提供特征聘请经由的算法(如iClusterBayes、CIMLR和MoCluster),死心还包含一个feat.res对象,存储了这种经由的信息。对于触及分层聚类的算法(举例COCA、ConsensusClustering),样本聚类的相应树状图也将看成clust.dend复返,要是用户想要将它们放在热图中会很灵验。
同期进行多种分型算法可以同期字据多种算法进行分型,然后整合它们的死心,得到最终的死心,不是一般的宏大:
# perform multi-omics integrative clustering with the rest of 9 algorithmsmoic.res.list <- getMOIC(data = mo.data, methodslist = list("SNF", "PINSPlus", "NEMO", "COCA", "LRAcluster", "ConsensusClustering", "IntNMF", "CIMLR", "MoCluster"), # 9种算法 N.clust = 5, type = c("gaussian", "gaussian", "gaussian", "binomial"))## --you choose more than 1 algorithm and all of them shall be run with parameters by default.## SNF done...## Clustering method: kmeans## Perturbation method: noise## PINSPlus done...## NEMO done...## COCA done...## LRAcluster done...## end fraction## clustered
## ConsensusClustering done...## IntNMF done...## clustering done...## feature selection done...## CIMLR done...## clustering done...## feature selection done...## MoCluster done...
再把贝叶斯的死心一谈加进来,这即是10种算法了:
管件加工 0, 0, 0.55) 0px 2px 10px;line-height: 1.6 !important;">moic.res.list <- append(moic.res.list, list("iClusterBayes" = iClusterBayes.res))# 保存下死心save(moic.res.list, file = "moic.res.list.rda")整合多种分型死心
模仿了consensus ensembles的认识,完毕对多个分型算法死心的整合。
可以画出一个一致性热图:
load(file = "moic.res.list.rda")cmoic.brca <- getConsensusMOIC(moic.res.list = moic.res.list, fig.name = "CONSENSUS HEATMAP", distance = "euclidean", linkage = "average")
图片
unnamed-chunk-15-186542957死心会保存在面前责任目次中。
检察分型死心的质料除了通过上头的热图检察分型死心,还可以使用Silhouette准则判断分型质料。
以下是诠释,来源于集会:
Silhouette准则是一种用于聚类分析中的评价门径,它通过对每个数据点与其所属簇内其他数据点之间的距离进行比拟,来预想聚类质料的横蛮。Silhouette准则可以匡助咱们详情最好的聚类数目,从而进步聚类分析的可靠性和准确性。 Silhouette准则的诡计门径如下:对于每个数据点i,诡计它与同簇中其他数据点之间的平均距离ai,以及与最近其他簇中数据点之间的平均距离bi。然后,界说每个数据点的Silhouette所有这个词为: s(i) = (bi - ai) / max(ai, bi) Silhouette所有这个词的取值范围在-1到1之间,其中负值泄漏数据点更容易被分类到诞妄的簇中,而赶巧则泄漏数据点更容易被正确分类。Silhouette所有这个词的平均值可以用来评估整个这个词聚类的质料,因此,Silhouette准则的场所是最大化Silhouette所有这个词的平均值,从而找到最好的聚类数目。 当聚类数目加多时,Silhouette所有这个词的平均值频频会先加多后减少。因此,咱们需要找到一个聚类数目,使得Silhouette所有这个词的平均值达到最大值。频频,咱们和会过绘制Silhouette图来聘请最好的聚类数目。Silhouette图是一种以Silhouette所有这个词为纵轴,聚类数目为横轴的图表,它可以匡助咱们直不雅地领汇注类的质料。 在使用Silhouette准则进行聚类分析时,需要矜重以下几点:
Silhouette所有这个词只适用于欧氏距离或有关度量,对于其他距离度量可能不适用。Silhouette所有这个词的诡计时辰较长,因此在措置大畛域数据时需要矜重诡计遵守。Silhouette所有这个词并不是唯独的评价策画,对于特定的聚类问题可能需措施受其他评价策画。死心会保存在面前责任目次中:
getSilhouette(sil = cmoic.brca$sil, # a sil object returned by getConsensusMOIC() fig.path = getwd(), fig.name = "SILHOUETTE", height = 5.5, width = 5)
图片
unnamed-chunk-16-186542957## png ## 2多组学分型热图
分型之后,信托是要对每个组学数据进行热图展示不同亚型的抒发量情况。
不外需要作念一些准备责任。
把甲基化的β值矩阵退换为M值矩阵,作家推选,这么作念展示恶果更好;数据程序化,画热图之钱一般都会进行这个操作,其实是通过scale进行的,比如把所脱落据压缩为[-2,2],当先2的用2泄漏,小于-2的用-2泄漏# β值矩阵退换为M值矩阵indata <- mo.dataindata$meth.beta <- log2(indata$meth.beta / (1 - indata$meth.beta))# 对数据进行程序化plotdata <- getStdiz(data = indata, halfwidth = c(2,2,2,NA), # no truncation for mutation centerFlag = c(T,T,T,F), # no center for mutation scaleFlag = c(T,T,T,F)) # no scale for mutation
咱们这里就用贝叶斯分型的死心进行展示,开头是索要每个组学的死心,然后每个组学中聘请前10个分子进行标注:
feat <- iClusterBayes.res$feat.resfeat1 <- feat[which(feat$dataset == "mRNA.expr"),][1:10,"feature"] feat2 <- feat[which(feat$dataset == "lncRNA.expr"),][1:10,"feature"]feat3 <- feat[which(feat$dataset == "meth.beta"),][1:10,"feature"]feat4 <- feat[which(feat$dataset == "mut.status"),][1:10,"feature"]annRow <- list(feat1, feat2, feat3, feat4)
底下即是绘制即可,其实亦然借助complexheatmap完毕的,只不外帮你简化了好多经由,死心会自动保存在面前责任目次下,MOVICS的默许出图照旧很好意思不雅的,可能比你我方画的好意思瞻念~
# 为每个组学的热图自界说心扉,不界说也可mRNA.col <- c("#00FF00", "#008000", "#000000", "#800000", "#FF0000")lncRNA.col <- c("#6699CC", "white" , "#FF3C38")meth.col <- c("#0074FE", "#96EBF9", "#FEE900", "#F00003")mut.col <- c("grey90" , "black")col.list <- list(mRNA.col, lncRNA.col, meth.col, mut.col)# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = iClusterBayes.res$clust.res, # cluster results clust.dend = NULL, # no dendrogram show.rownames = c(F,F,F,F), # specify for each omics data show.colnames = FALSE, # show no sample names annRow = annRow, # mark selected features color = col.list, annCol = NULL, # no annotation for samples annColors = NULL, # no annotation color width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF ICLUSTERBAYES")
图片
unnamed-chunk-19-186542957上头是贝叶斯门径分型死心的展示,你也可以任选一种,毕竟咱们有10种算法。
比如聘请COCA法的死心进行展示,亦然一模相通的用法,死心会自动保存:
# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = moic.res.list$COCA$clust.res, # cluster results clust.dend = moic.res.list$COCA$clust.dend, # show dendrogram for samples color = col.list, width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF COCA")
图片
unnamed-chunk-20-186542957要是你要展示多个临床信息,亦然径直添加即可,矜重自界说心扉需要使用circlize完毕:
# extract PAM50, pathologic stage and age for sample annotationannCol <- surv.info[,c("PAM50", "pstage", "age"), drop = FALSE]# generate corresponding colors for sample annotationannColors <- list(age = circlize::colorRamp2(breaks = c(min(annCol$age), median(annCol$age), max(annCol$age)), colors = c("#0000AA", "#555555", "#AAAA00")), PAM50 = c("Basal" = "blue", "Her2" = "red", "LumA" = "yellow", "LumB" = "green", "Normal" = "black"), pstage = c("T1" = "green", "T2" = "blue", "T3" = "red", "T4" = "yellow", "TX" = "black"))# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = cmoic.brca$clust.res, # consensusMOIC results clust.dend = NULL, # show no dendrogram for samples show.rownames = c(F,F,F,F), # specify for each omics data show.colnames = FALSE, # show no sample names show.row.dend = c(F,F,F,F), # show no dendrogram for features annRow = NULL, # no selected features color = col.list, annCol = annCol, # annotation for samples annColors = annColors, # annotation color width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF CONSENSUSMOIC")
图片
unnamed-chunk-21-186542957是不短长常给力?
到这里第一部分的试验就先容已矣管件加工,底下即是探索、比拟不同的亚型了。
本站仅提供存储做事,整个试验均由用户发布,如发现存害或侵权试验,请点击举报。