在python中处理大型密集矩阵 [英] Handling large dense matrices in python

查看：398 发布时间：2020/5/7 19:00:29 python matrix 32-bit python-2.6 windows-xp

本文介绍了在python中处理大型密集矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

基本上，在python中存储和使用密集矩阵的最佳方法是什么?

Basically, what is the best way to go about storing and using dense matrices in python?

我有一个项目，可以在数组中的每个项目之间生成相似性指标.

I have a project that generates similarity metrics between every item in an array.

每个项目都是一个自定义类，并存储指向另一个类的指针和一个数字，该数字表示该类与该类的亲密性".

Each item is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class.

现在，它可以很好地处理约8000个项目，此后由于内存不足错误而失败.
基本上，如果您假设每个比较都使用〜30(根据测试似乎是准确的)字节来存储相似性，则意味着所需的总内存为:
numItems^2 * itemSize = Memory
因此，内存使用量是基于项目数的指数.
就我而言，每个链接的内存大小约为30个字节，因此:
8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
恰好在单个线程的内存限制内.

Right now, it works brilliantly up to about ~8000 items, after which it fails with a out-of-memory error.
Basically, if you assume that each comparison uses ~30 (seems accurate based on testing) bytes to store the similarity, that means the total required memory is:
numItems^2 * itemSize = Memory
So the memory usage is exponential based on the number of items.
In my case, the memory size is ~30 bytes per link, so:
8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
which is right at the memory limit for a single thread.

在我看来，必须有一种更有效的方法来执行此操作.我已经看过映射，但是仅生成相似性值已经在计算上占了很大的比重，而通过硬盘驱动器将其全部瓶颈似乎有点荒谬.

It seems to me that there has to be a more effective way of doing this. I've looked at memmapping, but it's already computationally intensive just to generate the similarity values, and bottlenecking it all through a harddrive seems a little ridiculous.

修改
我看过numpy和scipy.不幸的是，它们也不支持非常大的数组.

Edit
I've looked at numpy and scipy. Unfortunatly, they don't support very large arrays either.

>>> np.zeros((20000,20000), dtype=np.uint16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError
>>>

进一步编辑
脾气暴躁似乎很受欢迎.但是，至少在没有另一个抽象层的情况下，numpy并不会真正实现我想要的功能.

Further Edit
Numpy seems to be popular. However, numpy won't really do what I want, at least without another abstraction layer.

我不希望存储数字，我想存储对类的引用. Numpy支持对象，但是并不能真正解决数组大小的问题.我提出numpy只是作为无效正常工作的示例.

I don't want to store numbers, I want to store reference to classes. Numpy supports objects, but that doesn't really address the array size issues. I brought up numpy just as an example of what isn't working.

有什么建议吗?

编辑好吧，我最后只是重写了所有逻辑，因此它不再存储任何冗余值，从而将内存使用量从O*n^2减少到O*((n*(n-1))/2).

Edit Well, I wound up just rewriting all the logic so it no longer stores any redundant values, reducing the memory useage from O*n^2 to O*((n*(n-1))/2).

基本上，这是握手问题的一个版本，所以我已经从所有链接的存储切换为每个链接的单个版本.

Basically, this whole affair is a version of the handshake problem, so I've switched from storing all links to only a single version of each link.

这不是一个完整的解决方案，但是我通常没有足够大的数据集来溢出它，因此我认为它可以解决. PyTables确实很有趣，但是我不知道任何SQL，并且似乎没有任何不错的传统切片或基于索引的方式来访问表数据.以后我可能会再讨论这个问题.

It's not a complete solution, but I generally don't have any datasets large enough to overflow it, so I think it will work out. PyTables is really interesting, but I don't know any SQL, and there doesn't appear to be any nice traditional slicing or index based way to access the table data. I may revisit the issue in the future.

在python中处理大型密集矩阵 [英] Handling large dense matrices in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python中处理大型密集矩阵 [英] Handling large dense matrices in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭