scikit学习DBSCAN内存使用情况 [英] scikit-learn DBSCAN memory usage

查看：308 发布时间：2020/10/3 1:56:13 python scikit-learn cluster-analysis data-mining dbscan

本文介绍了scikit学习DBSCAN内存使用情况的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更新：最后，我选择用于对大型数据集进行聚类的解决方案是以下Anony-Mousse提出的一种解决方案。也就是说，使用ELKI的DBSCAN隐式方法进行群集，而不是scikit-learn。它可以从命令行运行，并具有适当的索引编制，可以在几个小时内执行此任务。使用GUI和小型样本数据集找出您要使用的选项，然后前往城镇。值得一看。任何人，请继续阅读，以描述我的原始问题和进行一些有趣的讨论。

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line and with proper indexing, performs this task within a few hours. Use the GUI and small sample datasets to work out the options you want to use and then go to town. Worth looking into. Anywho, read on for a description of my original problem and some interesting discussion.

我有一个约250万个样本的数据集，每个样本都有35个特征（浮点值）我正在尝试集群。我一直在尝试使用scikit-learn的DBSCAN实现，使用曼哈顿距离度量标准和从数据中提取的一些随机小样本估算出的epsilon值。到现在为止还挺好。（以下是代码段，仅供参考）

I have a dataset with ~2.5 million samples, each with 35 features (floating point values) that I'm trying to cluster. I've been trying to do this with scikit-learn's implementation of DBSCAN, using the Manhattan distance metric and a value of epsilon estimated from some small random samples drawn from the data. So far, so good. (here is the snippet, for reference)

db = DBSCAN(eps=40, min_samples=10, metric='cityblock').fit(mydata)

目前，我的问题是我很容易用完内存。（我目前正在使用具有16 GB RAM的计算机）

My issue at the moment is that I easily run out of memory. (I'm currently working on a machine with 16 GB of RAM)

我的问题是，DBSCAN是在运行时实时计算成对距离矩阵吗，那就是什么让我记忆犹新？（250万^ 2）* 8字节显然是非常大的，我会理解的。我是否应该不使用 fit（）方法？更笼统地说，有没有办法解决这个问题，或者我通常会在这里树错树吗？

My question is, is DBSCAN calculating the pairwise distance matrix on the fly as it runs, and that's what's gobbling up my memory? (2.5 million ^ 2) * 8 bytes is obviously stupidly large, I would understand that. Should I not be using the fit() method? And more generally, is there a way around this issue, or am I generally barking up the wrong tree here?

如果答案很明显，那么表示歉意。几天来，我一直对此感到困惑。谢谢！

Apologies if the answer winds up being obvious. I've been puzzling over this for a few days. Thanks!

附录：如果有人能解释 fit（X）与 fit_predict（X）对我来说更明确些，我也将不胜感激–恐怕我只是不太了解它。

Addendum: Also if anyone could explain the difference between fit(X) and fit_predict(X) to me more explicitly I'd also appreciate that--I'm afraid I just don't quite get it.

附录2：可以肯定的是，我只是在具有约550 GB RAM的计算机上尝试了此方法，但它仍然被炸毁，因此我觉得DBSCAN可能试图建立成对距离矩阵或我显然不知道的东西想要它做。我想现在最大的问题是如何停止这种行为，或者找到其他更适合我需要的方法。感谢您在这里与我保持联系。

Addendum #2: To be sure, I just tried this on a machine with ~550 GB of RAM and it still blew up, so I feel like DBSCAN is likely trying to make a pairwise distance matrix or something I clearly don't want it to do. I guess now the big question is how to stop that behavior, or find other methods that might suit my needs more. Thanks for bearing with me here.

附录＃3（！）：我忘记附加回溯了，就是这样，

Addendum #3(!): I forgot to attach the traceback, here it is,

Traceback (most recent call last):
  File "tDBSCAN.py", line 34, in <module>
    db = DBSCAN(eps=float(sys.argv[2]), min_samples=10, metric='cityblock').fit(mydata)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/base.py", line 329, in fit_predict
    self.fit(X)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 186, in fit
    **self.get_params())
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 69, in dbscan
    D = pairwise_distances(X, metric=metric)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 651, in pairwise_distances
    return func(X, Y, **kwds)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 237, in manhattan_distances
    D = np.abs(X[:, np.newaxis, :] - Y[np.newaxis, :, :])
MemoryError

scikit学习DBSCAN内存使用情况 [英] scikit-learn DBSCAN memory usage

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

scikit学习DBSCAN内存使用情况 [英] scikit-learn DBSCAN memory usage

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭