Python查找树状图的替代方法 [英] Python alternate way to find dendrogram
问题描述
我有8000x100尺寸的数据。我需要将这8000个项目聚类。我对这些物品的订购更感兴趣。对于较小的数据,我可以从上面的代码中获得所需的结果,但对于较大的维度,我不断收到运行时错误 RuntimeError:获取对象的str时超出了最大递归深度。有没有另一种方法可以从 Z中获取重新排序的列。
I have data of dimension 8000x100. I need to cluster these 8000 items. I am more interested in the ordering of these items. I could get the desired result from the above code for small data but for higher dimension, I keep getting runtime error "RuntimeError: maximum recursion depth exceeded while getting the str of an object". Is there an alternate way to to get the reordered column from "Z".
from hcluster import pdist, linkage, dendrogram
import numpy
from numpy.random import rand
x = rand(8,100) # rand(8000,100) gives runtime error
Y = pdist(x)
Z = linkage(Y)
reorderedCol = dendrogram(Z)['ivl']
Traceback:
>>> from hcluster import pdist, linkage, dendrogram
>>> import numpy
>>> from numpy.random import rand
>>>
>>> x = rand(8000,100)
>>> Y = pdist(x)
>>> Z = linkage(Y)
>>> reorderedCol = dendrogram(Z)['ivl']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2062, in dendrogram
link_color_func=link_color_func)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
link_color_func=link_color_func)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
link_color_func=link_color_func)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2342, in _dendrogram_calculate_info
...
...
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2311, in _dendrogram_calculate_info
link_color_func=link_color_func)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2209, in _dendrogram_calculate_info
_append_singleton_leaf_node(Z, p, n, level, lvs, ivl, leaf_label_func, i, labels)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/hcluster/hierarchy.py", line 2091, in _append_singleton_leaf_node
ivl.append(str(int(i)))
RuntimeError: maximum recursion depth exceeded while getting the str of an object
>>>
推荐答案
问题是树状图是一种可视化技术。在8000个对象时,它已经变得几乎不可读。这就是为什么它可能未对此进行优化的原因。
The problem is that dendrogram is a visualization technique. At 8000 objects, it gets pretty much unreadable already. Which is why it probably wasn't optimized for this.
对于较大的数据集,我建议您不要使用任何类型的层次集群(在使用矩阵运算实现时) O(n ^ 3)
运行时,在某些情况下,您可以在 O(n ^ 2)
中进行),而是使用例如 OPTICS(维基百科)(并不在Weka中使用OPTICS,或浮动的python版本-既不完整!)。
For larger data sets, I recommend going away from any kind of hierarchical cluster (which has when implemented with matrix operations an O(n^3)
runtime, and for some cases you can do it in O(n^2)
), and instead use e.g. OPTICS (Wikipedia) (and do not use OPTICS in Weka, or that python version that is floating around - afaict they are both incomplete!).
我什至无法运行树形图
,出现错误 matplotlib不可用。绘图请求被拒绝
。因此,实际上确实会尝试可视化树状图!如果它花费大量精力优化可视化效果,则可能会耗尽内存。如我在另一个问题中向您展示的那样,通过自己进行操作计算树状树叶的排序您应该能够避免这笔额外费用。
I cannot even run dendrogram
, I get the error matplotlib not available. Plot request denied
. So it probably does actually try to visualize the dendrogram! Which may well run out of memory if it puts a lot of effort into optimizing the visualization. By doing it yourself as I showed you in your other question Calculate ordering of dendrogram leaves you should be able to avoid this extra cost.
您是否有理由使用 hcluster
而不是 scipy.cluster.hierarchy
?
Is there a reason you are using hcluster
instead of scipy.cluster.hierarchy
?
这篇关于Python查找树状图的替代方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!