在R中使用余弦距离的层次聚类 [英] Hierarchical clustering using cosine distance in R

查看:401
本文介绍了在R中使用余弦距离的层次聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过与文档主体的R编程语言使用余弦相似度来进行层次聚类,但是出现以下错误:

I want to do hierarchical clustering by using cosine similarity with the R programming language for corpus of documents, but I got the following error:

if(is.na(n)|| n> 65536L)stop("size不能为NA或 超过65536):缺少值,需要TRUE/FALSE

Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : missing value where TRUE/FALSE needed

我该怎么办?

要重现它,下面是一个示例:

To reproduce it, here's an example:

library(tm)
doc <- c( "The sky is blue.", "The sun is bright today.", "The sun in the sky is bright.", "We can see the shining sun, the bright sun." )
doc_corpus <- Corpus( VectorSource(doc) )
control_list <- list(removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
tdm <- TermDocumentMatrix(doc_corpus, control = control_list)



tf <- as.matrix(tdm)
( idf <- log( ncol(tf) / ( 1 + rowSums(tf != 0) ) ) )
( idf <- diag(idf) )
tf_idf <- crossprod(tf, idf)
colnames(tf_idf) <- rownames(tf)

tf_idf

cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2))))
cluster1 <- hclust(cosine_dist, method = "ward.D")

然后我得到了错误:

if(is.na(n)|| n> 65536L)stop("size不能为NA或 超过65536):缺少值,需要TRUE/FALSE

Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : missing value where TRUE/FALSE needed

推荐答案

有2个问题:

1:cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2))))创建NaN的原因是您将其除以0.

1: cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2)))) creates NaN's because you divide by 0.

2:hclust需要一个dist对象,而不仅仅是一个矩阵.有关更多详细信息,请参见?hclust

2: hclust needs a dist object, not just a matrix. See ?hclust for more details

都可以使用以下代码解决:

Both can be solved with the following code:

.....
cosine_dist = 1-crossprod(tf_idf) /(sqrt(colSums(tf_idf^2)%*%t(colSums(tf_idf^2))))

# remove NaN's by 0
cosine_dist[is.na(cosine_dist)] <- 0

# create dist object
cosine_dist <- as.dist(cosine_dist)

cluster1 <- hclust(cosine_dist, method = "ward.D")

plot(cluster1)

这篇关于在R中使用余弦距离的层次聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆