在R中聚类非常大的数据集 [英] clustering very large dataset in R
问题描述
我有一个由70,000个数值组成的数据集,这些数值代表从0到50的距离,我想对这些数字进行聚类;但是,如果我尝试经典的聚类方法,则必须建立一个70,000X70,000的距离矩阵,该矩阵代表数据集中每两个数字之间的距离,因此不适合存储在内存中,因此我想知道是否存在有什么聪明的方法可以解决此问题,而无需进行分层抽样? 我还在R中尝试过bigmemory和大型分析库,但仍然无法将数据放入内存中
I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big analytics libraries in R but still can't fit the data into memory
推荐答案
您可以使用通常适合此数据量的kmeans
来计算重要的中心数(1000、2000,...),在这些中心的坐标上执行分层聚类方法.这样距离矩阵会更小.
You can use kmeans
, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.
## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)
# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")
这篇关于在R中聚类非常大的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!