矩阵非常大的K均值 [英] K-means with really large matrix

查看:96
本文介绍了矩阵非常大的K均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在一个非常大的矩阵(大约300.000x100.000值,大于100Gb)上执行k均值聚类.我想知道是否可以使用R软件执行此操作或weka. 我的计算机是具有8Gb内存和数百Gb可用空间的多处理器.

I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space.

我有足够的计算空间,但是加载这样的矩阵似乎是R的问题(我认为使用bigmemory包不会帮助我,并且如果没有足够的空间,大矩阵会自动使用我所有的RAM,然后自动使用我的交换文件空间).

I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough space).

所以我的问题是:我应该使用什么软件(最终与其他软件包或自定义设置相关联).

So my question is : what software should I use (eventually in association with some other packages or custom settings).

感谢您的帮助.

注意:我使用的是Linux.

Note : I use linux.

推荐答案

是否必须是K均值?另一种可能的方法是先将数据转换为网络,然后再应用图聚类.我是 MCL 的作者,MCL是一种在生物信息学中经常使用的算法.链接到的实现应轻松扩展到具有数百万个节点的网络-假设您具有100K属性,您的示例将有300K节点.使用这种方法,数据将自然地在数据转换步骤中被修剪-并且该步骤很可能会成为瓶颈.您如何计算两个向量之间的距离?在我处理过的应用程序中,我使用了Pearson或Spearman相关性,并且MCL附带了用于对大规模数据进行有效计算的软件(它可以利用多个CPU和多个计算机).

Does it have to be K-means? Another possible approach is to transform your data into a network first, then apply graph clustering. I am the author of MCL, an algorithm used quite often in bioinformatics. The implementation linked to should easily scale up to networks with millions of nodes - your example would have 300K nodes, assuming that you have 100K attributes. With this approach, the data will be naturally pruned in the data transformation step - and that step will quite likely become the bottleneck. How do you compute the distance between two vectors? In the applications that I have dealt with I used the Pearson or Spearman correlation, and MCL is shipped with software to efficiently perform this computation on large scale data (it can utilise multiple CPUs and multiple machines).

数据大小仍然存在问题,因为大多数聚类算法将要求您至少至少一次执行所有成对比较.您的数据真的存储为巨型矩阵吗?输入中有很多零吗?另外,您有舍弃较小元素的方法吗?您是否可以使用一台以上的计算机来分发这些计算?

There is still an issue with the data size, as most clustering algorithms will require you to at least perform all pairwise comparisons at least once. Is your data really stored as a giant matrix? Do you have many zeros in the input? Alternatively, do you have a way of discarding smaller elements? Do you have access to more than one machine in order to distribute these computations?

这篇关于矩阵非常大的K均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆