计算大数据的不相似矩阵 [英] Compute dissimilarity matrix for large data

查看:169
本文介绍了计算大数据的不相似矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于具有数值和分类特征的大数据帧来计算差异矩阵.当我从集群软件包运行daisy函数时,我得到了错误消息:

I'm trying to compute a dissimilarity matrix based on a big data frame with both numerical and categorical features. When I run the daisy function from the cluster package I get the error message:

错误:无法分配大小为X的向量.

Error: cannot allocate vector of size X.

在我的情况下,X约为800 GB.知道我该如何处理这个问题吗?另外,如果有人可以帮助我在并行内核中运行该功能,那也很好.在下面,您可以找到用于计算虹膜数据集上的相异矩阵的函数:

In my case X is about 800 GB. Any idea how I can deal with this problem? Additionally it would be also great if someone could help me to run the function in parallel cores. Below you can find the function that computes the dissimilarity matrix on the iris dataset:

require(cluster)
d <- daisy(iris)

推荐答案

我之前也遇到过类似的问题.在我的数据集中的甚至5k行上运行daisy()都花费了很长时间.

I've had a similar issue before. Running daisy() on even 5k rows of my dataset took a really long time.

我最终使用了h2o包中的kmeans算法,该算法并行化并热编码分类数据.在将数据插入h2o.kmeans之前,我只需要确保对数据进行居中和缩放(平均0 w/stdev = 1).这样一来,聚类算法就不会对标称差异大的列进行优先级排序(因为它正在尝试最小化距离计算).我使用了scale()函数.

I ended up using the kmeans algorithm in the h2o package which parallelizes and 1-hot encodes categorical data. I would just make sure to center and scale your data (mean 0 w/ stdev = 1) before plugging it into h2o.kmeans. This is so that the clustering algorithm doesn't prioritize columns that have high nominal differences (since it's trying to minimize the distance calculation). I used the scale() function.

安装h2o后:

h2o.init(nthreads = 16, min_mem_size = '150G')
h2o.df <- as.h2o(df)
h2o_kmeans <- h2o.kmeans(training_frame = h2o.df, x = vars, k = 5, estimate_k = FALSE, seed = 1234)
summary(h2o_kmeans)

这篇关于计算大数据的不相似矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆