Python查找树状图的替代方法 [英] Python alternate way to find dendrogram

查看:104
本文介绍了Python查找树状图的替代方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有8000x100尺寸的数据。我需要将这8000个项目聚类。我对这些物品的订购更感兴趣。对于较小的数据,我可以从上面的代码中获得所需的结果,但对于较大的维度,我不断收到运行时错误 RuntimeError:获取对象的str时超出了最大递归深度。有没有另一种方法可以从 Z中获取重新排序的列。

I have data of dimension 8000x100. I need to cluster these 8000 items. I am more interested in the ordering of these items. I could get the desired result from the above code for small data but for higher dimension, I keep getting runtime error "RuntimeError: maximum recursion depth exceeded while getting the str of an object". Is there an alternate way to to get the reordered column from "Z".

from hcluster import pdist, linkage, dendrogram
import numpy
from numpy.random import rand

x = rand(8,100) # rand(8000,100) gives runtime error
Y = pdist(x)
Z = linkage(Y)
reorderedCol = dendrogram(Z)['ivl']


Traceback: 

>>> from hcluster import pdist, linkage, dendrogram
>>> import numpy
>>> from numpy.random import rand
>>> 

>>> x = rand(8000,100)
>>> Y = pdist(x)
>>> Z = linkage(Y)
>>> reorderedCol = dendrogram(Z)['ivl']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2062, in dendrogram
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info

...
...

  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2311, in _dendrogram_calculate_info
    link_color_func=link_color_func)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2209, in _dendrogram_calculate_info
    _append_singleton_leaf_node(Z, p, n, level, lvs, ivl, leaf_label_func, i, labels)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2091, in _append_singleton_leaf_node
    ivl.append(str(int(i)))
RuntimeError: maximum recursion depth exceeded while getting the str of an object
>>> 


推荐答案

问题是树状图是一种可视化技术。在8000个对象时,它已经变得几乎不可读。这就是为什么它可能未对此进行优化的原因。

The problem is that dendrogram is a visualization technique. At 8000 objects, it gets pretty much unreadable already. Which is why it probably wasn't optimized for this.

对于较大的数据集,我建议您不要使用任何类型的层次集群(在使用矩阵运算实现时) O(n ^ 3)运行时,在某些情况下,您可以在 O(n ^ 2)中进行),而是使用例如 OPTICS(维基百科)(并在Weka中使用OPTICS,或浮动的python版本-既不完整!)。

For larger data sets, I recommend going away from any kind of hierarchical cluster (which has when implemented with matrix operations an O(n^3) runtime, and for some cases you can do it in O(n^2)), and instead use e.g. OPTICS (Wikipedia) (and do not use OPTICS in Weka, or that python version that is floating around - afaict they are both incomplete!).

我什至无法运行树形图 ,出现错误 matplotlib不可用。绘图请求被拒绝。因此,实际上确实会尝试可视化树状图!如果它花费大量精力优化可视化效果,则可能会耗尽内存。如我在另一个问题中向您展示的那样,通过自己进行操作计算树状树叶的排序您应该能够避免这笔额外费用。

I cannot even run dendrogram, I get the error matplotlib not available. Plot request denied. So it probably does actually try to visualize the dendrogram! Which may well run out of memory if it puts a lot of effort into optimizing the visualization. By doing it yourself as I showed you in your other question Calculate ordering of dendrogram leaves you should be able to avoid this extra cost.

您是否有理由使用 hcluster 而不是 scipy.cluster.hierarchy

Is there a reason you are using hcluster instead of scipy.cluster.hierarchy?

这篇关于Python查找树状图的替代方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆