Python中具有大稀疏矩阵的kNN [英] kNN with big sparse matrices in Python

查看:188
本文介绍了Python中具有大稀疏矩阵的kNN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个大的稀疏矩阵:

I have two large sparse matrices:

In [3]: trainX
Out[3]: 
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
        with 286674296 stored elements in Compressed Sparse Row format>

In [4]: testX
Out[4]: 
<2013337x755258 sparse matrix of type '<type 'numpy.float64'>'
        with 95423596 stored elements in Compressed Sparse Row format>

总共要加载约5 GB RAM.请注意,这些矩阵非常稀疏(占0.0062%).

About 5 GB RAM in total to load. Note these matrices are HIGHLY sparse (0.0062% occupied).

对于testX中的每一行,我想在trainX中找到 最近的邻居,并返回其在trainY中找到的相应标签. trainY是一个与trainX相同长度的列表,并且具有许多类. (一个类由1-5个单独的标签组成,每个标签是20,000个标签中的一个,但是类的数量与我现在要尝试执行的操作无关.)

For each row in testX, I want to find the Nearest Neighbor in trainX and return its corresponding label, found in trainY. trainY is a list with the same length as trainX and has many many classes. (A class is made up of 1-5 separate labels, each label is one of 20,000, but the number of classes is not relevant to what I am trying to do right now.)

我正在使用sklearn的KNN算法来做到这一点:

I am using sklearn's KNN algorithm to do this:

from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(trainX, trainY)
clf.predict(testX[0])

即使预测1个testX项也需要一段时间(即30-60秒,但是如果乘以200万,则几乎是不可能的).我的具有16GB RAM的笔记本电脑开始交换一点,但确实可以完成testX中的一项.

Even predicting for 1 item of testX takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible). My laptop with 16GB of RAM starts to swap a bit, but does manage to complete for 1 item in testX.

我的问题是,我该怎么做才能在合理的时间内完成?在大型EC2实例上说一晚?只会有更多的RAM并阻止其足够快的交换速度(我的猜测是没有).也许我可以以某种方式利用稀疏性来加快计算速度?

My questions is, how can I do this so it will finish in reasonable time? Say one night on a large EC2 instance? Would just having more RAM and preventing the swapping speed it up enough (my guess is no). Maybe I can somehow make use of the sparsity to speed up the calculation?

谢谢.

推荐答案

经典的kNN数据结构(例如,在sklearn中使用的KD树)在数据尺寸增加时变得非常慢.对于非常高维的问题,建议切换算法类并使用近似最近邻(ANN)方法,不幸的是,该方法似乎缺乏sklearn.有关算法和理论的论文,请参见下面的链接,为什么在这些情况下近似最近的邻居要快得多.

Classic kNN data structures such as the KD tree used in sklearn become very slow when the dimension of the data increases. For very high-dimensional problems it is advisable to switch algorithm class and use approximate nearest neighbour (ANN) methods, which sklearn seems to be lacking, unfortunately. See links below for papers on algorithms and theory why approximate nearest neighbors is so much faster in these cases.

  • A well-known ANN library in the C++ world, widely used in Computer Vision for nearest neighbors in feature descriptor spaces, is FLANN. The homepage says it contains Python bindings (I have never worked with then).

另一个流行的替代方法是使用Python的 ANN 库包装器此处,尽管目前更新的FLANN似乎更受欢迎.

Another popular alternative is the ANN library with Python wrapper here, although the newer FLANN seems to be more popular at the moment.

另请参见此答案(但有些链接已失效).

See also this answer (but some links are dead).

一个警告:您的数据似乎是非常高维的-我不知道这些库如何为您执行.他们仍然应该击败sklearn.

One caveat: Your data seems to be very high dimensional - I don't known how these libraries perform for you. They should still beat sklearn.

这篇关于Python中具有大稀疏矩阵的kNN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆