Python中具有大稀疏矩阵的kNN [英] kNN with big sparse matrices in Python

查看：188 发布时间：2020/5/16 23:26:13 python scikit-learn sparse-matrix nearest-neighbor

本文介绍了Python中具有大稀疏矩阵的kNN的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个大的稀疏矩阵:

I have two large sparse matrices:

In [3]: trainX
Out[3]: 
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
        with 286674296 stored elements in Compressed Sparse Row format>

In [4]: testX
Out[4]: 
<2013337x755258 sparse matrix of type '<type 'numpy.float64'>'
        with 95423596 stored elements in Compressed Sparse Row format>

总共要加载约5 GB RAM.请注意，这些矩阵非常稀疏(占0.0062％).

About 5 GB RAM in total to load. Note these matrices are HIGHLY sparse (0.0062% occupied).

对于testX中的每一行，我想在trainX中找到最近的邻居，并返回其在trainY中找到的相应标签. trainY是一个与trainX相同长度的列表，并且具有许多类. (一个类由1-5个单独的标签组成，每个标签是20,000个标签中的一个，但是类的数量与我现在要尝试执行的操作无关.)

For each row in testX, I want to find the Nearest Neighbor in trainX and return its corresponding label, found in trainY. trainY is a list with the same length as trainX and has many many classes. (A class is made up of 1-5 separate labels, each label is one of 20,000, but the number of classes is not relevant to what I am trying to do right now.)

我正在使用sklearn的KNN算法来做到这一点:

I am using sklearn's KNN algorithm to do this:

from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(trainX, trainY)
clf.predict(testX[0])

即使预测1个testX项也需要一段时间(即30-60秒，但是如果乘以200万，则几乎是不可能的).我的具有16GB RAM的笔记本电脑开始交换一点，但确实可以完成testX中的一项.

Even predicting for 1 item of testX takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible). My laptop with 16GB of RAM starts to swap a bit, but does manage to complete for 1 item in testX.

我的问题是，我该怎么做才能在合理的时间内完成?在大型EC2实例上说一晚?只会有更多的RAM并阻止其足够快的交换速度(我的猜测是没有).也许我可以以某种方式利用稀疏性来加快计算速度?

My questions is, how can I do this so it will finish in reasonable time? Say one night on a large EC2 instance? Would just having more RAM and preventing the swapping speed it up enough (my guess is no). Maybe I can somehow make use of the sparsity to speed up the calculation?

谢谢.

Python中具有大稀疏矩阵的kNN [英] kNN with big sparse matrices in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python中具有大稀疏矩阵的kNN [英] kNN with big sparse matrices in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭