一百万个对象的层次聚类 [英] Hierarchical clustering of 1 million objects

查看:81
本文介绍了一百万个对象的层次聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能指出我一个可以对大约一百万个对象进行聚类的分层聚类工具(在python中更可取)?我已经尝试过 hcluster

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.

hcluster遇到了18k个对象的问题. Orange能够在几秒钟内将18k个对象聚类,但是失败了10万个对象(饱和内存并最终崩溃).

hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).

我在Ubuntu 11.10上的64位Xeon CPU(2.53GHz)和8GB RAM + 3GB交换空间上运行.

I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.

推荐答案

要击败O(n ^ 2),您必须先降低1M点(文档) 到1000堆,每个1000点,或100堆,每10k,或...
两种可能的方法:

To beat O(n^2), you'll have to first reduce your 1M points (documents) to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
Two possible approaches:

  • 从15k个点构建一个分层树,然后将剩下的一个接一个地添加: 时间〜1M *树深

  • build a hierarchical tree from say 15k points, then add the rest one by one: time ~ 1M * treedepth

首先建立100或1000个扁平集群, 然后建立100或1000个群集中心的层次树.

first build 100 or 1000 flat clusters, then build your hierarchical tree of the 100 or 1000 cluster centres.

这两种方法的工作效果如何关键取决于 在目标树的大小和形状上 多少个级别,多少个叶子?
您正在使用什么软件, 以及您需要进行多少小时/天的聚类?

How well either of these might work depends critically on the size and shape of your target tree -- how many levels, how many leaves ?
What software are you using, and how many hours / days do you have to do the clustering ?

对于集群方法, K-d_tree s 对于2d,3d,20d甚至128d的点都可以正常工作-不是您的情况. 我对群集文本几乎一无所知. 局部性_hashing 吗?

For the flat-cluster approach, K-d_tree s work fine for points in 2d, 3d, 20d, even 128d -- not your case. I know hardly anything about clustering text; Locality-sensitive_hashing ?

看看 scikit-学习群集- 它有几种方法,包括DBSCAN.

Take a look at scikit-learn clustering -- it has several methods, including DBSCAN.

已添加:另请参见
google-all-pairs-similarity-search 在稀疏矢量数据中找到所有相似矢量对的算法",Beyardo等. 2007
SO分层-集群化-启发式

Added: see also
google-all-pairs-similarity-search "Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
SO hierarchical-clusterization-heuristics

这篇关于一百万个对象的层次聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆