匹配并将群集号添加到原始数据 [英] match and add the cluster number to the original data

查看:98
本文介绍了匹配并将群集号添加到原始数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用常规方法来执行分层聚类"项目.

I am using the regular method to do a Hierarchical Clustering project.

mydata.dtm <- TermDocumentMatrix(mydata.corpus)
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.98)
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
groups <- cutree(fit, k=10)

groups
              congestion        cough          ear          eye        fever          flu   fluzonenon     medicare painpressure     physical         pink          ppd     pressure 
                       1            2            3            4            5            6            5            5            5            7            4            8            5 
                    rash    screening         shot        sinus         sore       sports     symptoms       throat          uti 
                       5            5            6            1            9            7            5            9           10 

我想将组号放回原始数据的新列中. 我看过单个列表中的近似字符串匹配-r 因为这里的df是文档矩阵,所以我在df <- t(data.frame(mydata.df.scale,cutree(hc,k=10)))之后得到的是像

And I what I want is to put the group number back to the new column in the original data. I've looked at approximate string matching within single list - r Because the df here is a document matrix so what I got after df <- t(data.frame(mydata.df.scale,cutree(hc,k=10))) is a matrix like

df[1:5,1:5]
     congestion cough ear eye fever
[1,]          0     0   0   0     0
[2,]          0     0   0   0     0
[3,]          0     0   0   0     0
[4,]          0     0   0   1     0
[5,]          0     0   0   0     0

由于眼睛的组号为3,所以我要将数字3添加到第4行的新列中.

Since eye has the group number 3 then I want add the number 3 to the new column in 4th row.

请注意,在这种情况下,单个文档可以映射到同一组中的两个项目.

note that in this case a single document can be mapped to two items in the same group.

df[23,17:21]
   sinus     sore   sports symptoms   throat 
       0        1        0        0        1 

推荐答案

我直接使用0-1矩阵,而不是直接放回数字:

label_back <-t(data.frame(mydata.df,cutree(fit,k=10))) 
row.names(label_back) <- NULL

#label_back<-label_back[1:(nrow(label_back)-1),]# the last line is the sum
groups.df<-as.data.frame(groups)
groups.df$label<-rownames(groups.df)

for (i in 1:length((colnames(label_back)))){
ind<-which(colnames(label_back)[i]==groups.df$label) # match names and return index
label_back[,i]=groups.df$groups[ind]*label_back[,i]  # time the 0-1 with the #group number
     }

在每行中找到最大值,因为某些行中的值超过1.

data_group<-rep(0,nrow(data)

for (i in 1:nrow(data)){
  data_group[i]<-max(unique(label_back[i,]))
}
data$group<-data_group

我正在寻找更优雅的方式.

I am looking for more elegant way.

这篇关于匹配并将群集号添加到原始数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆