在R中聚类非常大的数据集 [英] clustering very large dataset in R

查看:64
本文介绍了在R中聚类非常大的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由70,000个数值组成的数据集,这些数值代表从0到50的距离,我想对这些数字进行聚类;但是,如果我尝试经典的聚类方法,则必须建立一个70,000X70,000的距离矩阵,该矩阵代表数据集中每两个数字之间的距离,因此不适合存储在内存中,因此我想知道是否存在有什么聪明的方法可以解决此问题,而无需进行分层抽样? 我还在R中尝试过bigmemory和大型分析库,但仍然无法将数据放入内存中

I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory

推荐答案

您可以使用通常适合此数据量的kmeans来计算重要的中心数(1000、2000,...),在这些中心的坐标上执行分层聚类方法.这样距离矩阵会更小.

You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

这篇关于在R中聚类非常大的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆