通过R中的聚类为PCA图着色 [英] Colouring a PCA plot by clusters in R

查看:209
本文介绍了通过R中的聚类为PCA图着色的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些生物学数据,看起来像这样,有2种不同类型的簇(A和B):

I have some biological data that looks like this, with 2 different types of clusters (A and B):

                Cluster_ID       A1      A2      A3       B1       B2      B3
 5  chr5:100947454..100947489,+   3.31322  7.52365  3.67255  21.15730  8.732710 17.42640
12 chr5:101227760..101227782,+   1.48223  3.76182  5.11534  15.71680  4.426170 13.43560
29 chr5:102236093..102236457,+  15.60700 10.38260 12.46040   6.85094 15.551400  7.18341

我清理数据:

CAGE<-read.table("CAGE_expression_matrix.txt", header=T)
CAGE_data <- as.data.frame(CAGE)

#Remove clusters with 0 expression for all 6 samples
CAGE_filter <- CAGE[rowSums(abs(CAGE[,2:7]))>0,]

#Filter whole file to keep only clusters with at least 5 TPM in at least 3 files
CAGE_filter_more <- CAGE_filter[apply(CAGE_filter[,2:7] >= 5,1,sum) >= 3,]
CAGE_data <- as.data.frame(CAGE_filter_more)

此后,数据大小从6981个群集减少到599个.

The data size is reduced from 6981 clusters to 599 after this.

然后我继续申请PCA:

I then go on to apply PCA:

#Get data dimensions

dim(CAGE_data)
PCA.CAGE<-prcomp(CAGE_data[,2:7], scale.=TRUE) 
summary(PCA.CAGE)

我想创建一个数据的PCA图,标记每个样本,并根据样本的类型(A或B)为样本着色.因此,对于每个带有文本标签的图,该图应为两种颜色.

I want to create a PCA plot of the data, marking each sample and coloring the samples depending on their type (A or B.) So it should be two colors for the plot with text labels for each sample.

这是我尝试的错误结果:

This is what I have tried, to erroneous results:

qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data, data=as.data.frame(PCA.CAGE$x))

ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more, label=CAGE_filter_more)) + geom_point() + geom_text()

qplot(PCA.CAGE[1:3], PCA.CAGE[4:6], label=colnames(PC1, PC2, PC3), geom=c("point", "text"))

错误显示如下:

  > qplot(PCA.CAGE$x[,1:3],PCA.CAGE$x[4:6,], xlab="Data 1", ylab="Data 2")

  Error: Aesthetics must either be length one, or the same length as the dataProblems:PCA.CAGE$x[4:6, ]

  > qplot(PC1, PC2, colour = CAGE_data, geom=c("point"), label=CAGE_data,    data=as.data.frame(PCA.CAGE$x))

  Don't know how to automatically pick scale for object of type data.frame.   Defaulting to continuous
  Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous
  Error: Aesthetics must either be length one, or the same length as the dataProblems:CAGE_data, CAGE_data

 > ggplot(data=PCA.CAGE, aes(x=PCA1, y=PCA2, colour=CAGE_filter_more,      label=CAGE_filter_more)) + geom_point() + geom_text()

 Error: ggplot2 doesn't know how to deal with data of class 

推荐答案

您的问题没有任何意义(至少对我而言).您似乎有两组3个变量(A组和B组).在这6个变量上运行PCA时,您将获得6个主要组件,每个组件都是所有6个变量的(不同)线性组合.聚类基于案例(行).如果要基于前两台PC来对数据进行聚类(一种常见方法),则需要明确地进行此操作.这是一个使用内置iris数据集的示例.

Your question doesn't make sense (to me at least). You seem to have two groups of 3 variables (the A group and the B group). When you run PCA on these 6 variables, you'll get 6 principle components, each of which is a (different) linear combination of all 6 variables. Clustering is based on the cases (rows). If you want to cluster the data based on the first two PCs (a common approach), then you need to do that explicitly. Here's an example using the built-in iris data-set.

pca   <- prcomp(iris[,1:4], scale.=TRUE)
clust <- kmeans(pca$x[,1:2], centers=3)$cluster
library(ggbiplot)
ggbiplot(pca, groups=factor(clust)) + xlim(-3,3)

因此,这里我们在iris的前4列上运行PCA.然后,pca$x是在各列中包含主要成分的矩阵.因此,我们基于前两台PC运行k-means聚类,并将聚类号提取到clust中.然后,我们使用ggibplot(...)进行绘制.

So here we run PCA on the first 4 columns of iris. Then, pca$x is a matrix containing the principle components in the columns. So then we run k-means clustering based on the first 2 PCs, and extract the cluster numbers into clust. Then we use ggibplot(...) to make the plot.

这篇关于通过R中的聚类为PCA图着色的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆