在scikit-learn中,DBSCAN可以使用稀疏矩阵吗? [英] In scikit-learn, can DBSCAN use sparse matrix?

查看:161
本文介绍了在scikit-learn中,DBSCAN可以使用稀疏矩阵吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行scikit的dbscan算法时出现内存错误. 我的数据大约是20000 * 10000,它是一个二进制矩阵.

I got Memory Error when I was running dbscan algorithm of scikit. My data is about 20000*10000, it's a binary matrix.

(也许不适合将DBSCAN与这样的矩阵一起使用.我是机器学习的初学者.我只想找到不需要初始簇号的簇方法)

(Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number)

无论如何,我发现稀疏矩阵和scikit的特征提取.

Anyway I found sparse matrix and feature extraction of scikit.

http://scikit-learn.org/dev/modules/feature_extraction.html http://docs.scipy.org/doc/scipy/reference/sparse.html

但是我仍然不知道如何使用它.在DBSCAN的规范中,没有关于使用稀疏矩阵的指示.不允许吗?

But I still have no idea how to use it. In DBSCAN's specification, there is no indication about using sparse matrix. Is it not allowed?

如果有人知道如何在DBSCAN中使用稀疏矩阵,请告诉我. 或者您可以告诉我一个更合适的聚类方法.

If anyone knows how to use sparse matrix in DBSCAN, please tell me. Or you can tell me a more suitable cluster method.

推荐答案

不幸的是,DBSCAN的scikit实现非常幼稚.需要重新编写它,以考虑到索引(球树等).

The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.

到目前为止,它显然会坚持计算完整的距离矩阵,这会浪费大量的 内存.

As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.

我可以建议您自己重新实现DBSCAN.这很容易,存在良好的伪代码,例如在Wikipedia和原始出版物中.只需几行,您就可以轻松利用数据表示形式.例如.如果您已经有一个稀疏表示形式的相似度图,通常进行范围查询"是很简单的(即仅使用满足距离阈值的边缘)

May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)

这是scikit-learn github 中的一个问题关于改进实施.一位用户报告说,使用球树的版本速度提高了50倍(这并不奇怪,我之前也曾看到过类似的索引加速方法-当进一步增加数据集大小时,它的表现可能会更加明显).

Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).

更新:自撰写此答案以来,scikit-learn中的DBSCAN版本已得到实质性改进.

Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.

这篇关于在scikit-learn中,DBSCAN可以使用稀疏矩阵吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆