在R中聚类非常大的数据集 [英] clustering very large dataset in R

查看：64 发布时间：2020/5/4 8:57:05 r machine-learning bigdata cluster-analysis data-mining

本文介绍了在R中聚类非常大的数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个由70,000个数值组成的数据集，这些数值代表从0到50的距离，我想对这些数字进行聚类；但是，如果我尝试经典的聚类方法，则必须建立一个70,000X70,000的距离矩阵，该矩阵代表数据集中每两个数字之间的距离，因此不适合存储在内存中，因此我想知道是否存在有什么聪明的方法可以解决此问题，而无需进行分层抽样? 我还在R中尝试过bigmemory和大型分析库，但仍然无法将数据放入内存中

I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory

推荐答案

您可以使用通常适合此数据量的kmeans来计算重要的中心数(1000、2000，...)，在这些中心的坐标上执行分层聚类方法.这样距离矩阵会更小.

You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

这篇关于在R中聚类非常大的数据集的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在R中聚类非常大的数据集 [英] clustering very large dataset in R

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

在R中聚类非常大的数据集 [英] clustering very large dataset in R

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭