在python中处理大型密集矩阵 [英] Handling large dense matrices in python

查看:398
本文介绍了在python中处理大型密集矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,在python中存储和使用密集矩阵的最佳方法是什么?

Basically, what is the best way to go about storing and using dense matrices in python?

我有一个项目,可以在数组中的每个项目之间生成相似性指标.

I have a project that generates similarity metrics between every item in an array.

每个项目都是一个自定义类,并存储指向另一个类的指针和一个数字,该数字表示该类与该类的亲密性".

Each item is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class.

现在,它可以很好地处理约8000个项目,此后由于内存不足错误而失败.
基本上,如果您假设每个比较都使用〜30(根据测试似乎是准确的)字节来存储相似性,则意味着所需的总内存为:
numItems^2 * itemSize = Memory
因此,内存使用量是基于项目数的指数.
就我而言,每个链接的内存大小约为30个字节,因此:
8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
恰好在单个线程的内存限制内.

Right now, it works brilliantly up to about ~8000 items, after which it fails with a out-of-memory error.
Basically, if you assume that each comparison uses ~30 (seems accurate based on testing) bytes to store the similarity, that means the total required memory is:
numItems^2 * itemSize = Memory
So the memory usage is exponential based on the number of items.
In my case, the memory size is ~30 bytes per link, so:
8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
which is right at the memory limit for a single thread.

在我看来,必须有一种更有效的方法来执行此操作.我已经看过映射,但是仅生成相似性值已经在计算上占了很大的比重,而通过硬盘驱动器将其全部瓶颈似乎有点荒谬.

It seems to me that there has to be a more effective way of doing this. I've looked at memmapping, but it's already computationally intensive just to generate the similarity values, and bottlenecking it all through a harddrive seems a little ridiculous.

修改
我看过numpy和scipy.不幸的是,它们也不支持非常大的数组.

Edit
I've looked at numpy and scipy. Unfortunatly, they don't support very large arrays either.

>>> np.zeros((20000,20000), dtype=np.uint16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError
>>>

进一步编辑
脾气暴躁似乎很受欢迎.但是,至少在没有另一个抽象层的情况下,numpy并不会真正实现我想要的功能.

Further Edit
Numpy seems to be popular. However, numpy won't really do what I want, at least without another abstraction layer.

我不希望存储数字,我想存储对类的引用. Numpy支持对象,但是并不能真正解决数组大小的问题.我提出numpy只是作为无效正常工作的示例.

I don't want to store numbers, I want to store reference to classes. Numpy supports objects, but that doesn't really address the array size issues. I brought up numpy just as an example of what isn't working.

有什么建议吗?

编辑好吧,我最后只是重写了所有逻辑,因此它不再存储任何冗余值,从而将内存使用量从O*n^2减少到O*((n*(n-1))/2).

Edit Well, I wound up just rewriting all the logic so it no longer stores any redundant values, reducing the memory useage from O*n^2 to O*((n*(n-1))/2).

基本上,这是握手问题的一个版本,所以我已经从所有链接的存储切换为每个链接的单个版本.

Basically, this whole affair is a version of the handshake problem, so I've switched from storing all links to only a single version of each link.

这不是一个完整的解决方案,但是我通常没有足够大的数据集来溢出它,因此我认为它可以解决. PyTables确实很有趣,但是我不知道任何SQL,并且似乎没有任何不错的传统切片或基于索引的方式来访问表数据.以后我可能会再讨论这个问题.

It's not a complete solution, but I generally don't have any datasets large enough to overflow it, so I think it will work out. PyTables is really interesting, but I don't know any SQL, and there doesn't appear to be any nice traditional slicing or index based way to access the table data. I may revisit the issue in the future.

推荐答案

PyTables 可以处理任意表通过使用memmap和一些巧妙的压缩来实现大小(数百万列!).

PyTables can handle tables of arbitrary size (Millions of columns!), through using memmap and some clever compression.

表面上,它为python提供了类似SQL的性能.但是,它将需要对代码进行重大修改.

Ostensibly, it provides SQL like performance, to python. It will, however, require significant code modifications.

在完成更彻底的审核之前,我将不接受此答案,以确保它可以实际执行我想要的操作.或者有人提供更好的解决方案.

I'm not going to accept this answer until I've done a more thorough vetting, to ensure it can actually do what I want. Or someone provides a better solution.

这篇关于在python中处理大型密集矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆