聚类：如何提取最具特色的特征？ [英] Clustering: how to extract most distinguishing features?

查看：411 发布时间：2020/10/3 2:03:04 r cluster-analysis text-mining

本文介绍了聚类：如何提取最具特色的特征？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一组文档正在尝试根据它们的词汇进行聚类（也就是说，首先使用 DocumentTermMatrix 命令创建一个语料库，然后创建一个稀疏矩阵，然后依此类推）。为了改善群集并更好地理解使特定文档落入特定群集的特征/词是什么，我想知道每个群集最显着的特征是什么。

I have a set of documents that I am trying to cluster based on their vocabulary (that is, first making a corpus and then a sparse matrix with the DocumentTermMatrix command and so on). To improve the clusters and to understand better what features/words make a particular document fall into a particular cluster, I would like to know what the most distinguishing features for each cluster are.

Lantz的《用R进行机器学习》一书中有一个例子，如果您碰巧知道的话-他将青少年社交媒体个人资料按照他们所钉住的兴趣进行聚类，最后得到一个像这样的表，它显示每个群集...具有与其他群集最不同的功能：

There is an example of this in the Machine Learning with R book by Lantz, if you happen to know it - he clusters teen social media profiles by the interests they have pegged, and ends up with a table like this that shows "each cluster ... with the features that most distinguish it from the other clusters":

cluster 1  | cluster 2 | cluster 3 ....
swimming   | band      | sports  ... 
dance      | music     | kissed ....

现在，我的功能还不够丰富，但是仍然希望能够构建类似的东西。

但是，这本书没有解释表的构造方式。我已经尽了最大的努力去创造性地使用google，也许答案是在集群上进行一些明显的计算，但是作为R和统计的新手，我无法弄清楚。非常感谢您提供任何帮助，包括指向以前的问题的链接或我可能错过的其他资源！

Now, my features aren't quite as informative, but I'd still like to be able to build something like that.
However, the book does not explain how the table was constructed. I have tried my best to google creatively, and perhaps the answer is some obvious calculation on the cluster means, but being a newbie to R as well as to statistics, I could not figure it out. Any help is much appreciated, including links to previous questions or other resources I may have missed!

谢谢。

推荐答案

前段时间我遇到了类似的问题。

I had a similar problem some time ago..

这是我所做的：

require("tm")
require("skmeans")
require("slam")

# clus: a skmeans object
# dtm: a Document Term Matrix
# first: eg. 10 most frequent words per cluster
# unique: if FALSE all words of the DTM will be used
#         if TRUE only cluster specific words will be used 



# result: List with words and frequency of words 
#         If unique = TRUE, only cluster specific words will be considered.
#         Words which occur in more than one cluster will be ignored.



mfrq_words_per_cluster <- function(clus, dtm, first = 10, unique = TRUE){
  if(!any(class(clus) == "skmeans")) return("clus must be an skmeans object")

  dtm <- as.simple_triplet_matrix(dtm)
  indM <- table(names(clus$cluster), clus$cluster) == 1 # generate bool matrix

  hfun <- function(ind, dtm){ # help function, summing up words
    if(is.null(dtm[ind, ]))  dtm[ind, ] else  col_sums(dtm[ind, ])
  }
  frqM <- apply(indM, 2, hfun, dtm = dtm)

  if(unique){
    # eliminate word which occur in several clusters
    frqM <- frqM[rowSums(frqM > 0) == 1, ] 
  }
  # export to list, order and take first x elements 
  res <- lapply(1:ncol(frqM), function(i, mat, first)
                head(sort(mat[, i], decreasing = TRUE), first),
                mat = frqM, first = first)

  names(res) <- paste0("CLUSTER_", 1:ncol(frqM))
  return(res)
}

一个小例子：

data("crude")
dtm <- DocumentTermMatrix(crude, control =
                          list(removePunctuation = TRUE,
                               removeNumbers = TRUE,
                               stopwords = TRUE))

rownames(dtm) <- paste0("Doc_", 1:20)
clus <- skmeans(dtm, 3)


mfrq_words_per_cluster(clus, dtm)
mfrq_words_per_cluster(clus, dtm, unique = FALSE)

HTH

这篇关于聚类：如何提取最具特色的特征？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

聚类：如何提取最具特色的特征？ [英] Clustering: how to extract most distinguishing features?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

聚类：如何提取最具特色的特征？ [英] Clustering: how to extract most distinguishing features?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭