在 python 和 numpy 中处理大数据,内存不足,如何将部分结果保存在光盘上? [英] Working with big data in python and numpy, not enough ram, how to save partial results on disc?

查看:35
本文介绍了在 python 和 numpy 中处理大数据,内存不足,如何将部分结果保存在光盘上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 python 中实现具有 20 万多个数据点的 1000 维数据的算法.我想使用 numpy、scipy、sklearn、networkx 和其他有用的库.我想执行所有点之间的成对距离等操作并对所有点进行聚类.我已经实现了以合理的复杂性执行我想要的工作的算法,但是当我尝试将它们扩展到我的所有数据时,我的 RAM 用完了.当然,我这样做,为 200k+ 数据的成对距离创建矩阵需要大量内存.

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx, and other useful libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of RAM. Of course, I do, creating the matrix for pairwise distances on 200k+ data takes a lot of memory.

问题来了:我真的很想在内存不足的糟糕计算机上执行此操作.

Here comes the catch: I would really like to do this on crappy computers with low amounts of RAM.

有没有可行的方法让我在不受低 RAM 限制的情况下完成这项工作?需要更长的时间真的不是问题,只要时间reqs不去无穷大!

Is there a feasible way for me to make this work without the constraints of low RAM? That it will take a much longer time is really not a problem, as long as the time reqs don't go to infinity!

我希望能够让我的算法工作,然后在一五个小时后回来,并且不会因为它的 RAM 耗尽而卡住!我想在 python 中实现它,并且能够使用 numpy、scipy、sklearn 和 networkx 库.我希望能够计算到我所有点的成对距离等

I would like to be able to put my algorithms to work and then come back an hour or five later and not have it stuck because it ran out of RAM! I would like to implement this in python, and be able to use the numpy, scipy, sklearn, and networkx libraries. I would like to be able to calculate the pairwise distance to all my points etc

这可行吗?我该怎么做,我可以开始阅读什么?

Is this feasible? And how would I go about it, what can I start to read up on?

推荐答案

使用 numpy.memmap 可以创建直接映射到文件的数组:

Using numpy.memmap you create arrays directly mapped into a file:

import numpy
a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000))
# here you will see a 762MB file created in your working directory    

您可以将其视为常规数组:一个 += 1000.

You can treat it as a conventional array: a += 1000.

甚至可以将更多数组分配给同一个文件,如果需要,可以从相互来源控制它.但我在这里遇到了一些棘手的事情.要打开完整数组,您必须先关闭"前一个数组,使用 del:

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I've experiences some tricky things here. To open the full array you have to "close" the previous one first, using del:

del a    
b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

但是只打开数组的一部分可以实现同时控制:

But openning only some part of the array makes it possible to achieve the simultaneous control:

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000))
b[1,5] = 123456.
print a[1,5]
#123456.0

太好了!ab 一起更改.并且更改已经写入磁盘.

Great! a was changed together with b. And the changes are already written on disk.

另一个值得评论的重要事情是offset.假设您不想取 b 中的前 2 行,而是取第 150000 和 150001 行.

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),
                 offset=150000*1000*32/8)
b[1,2] = 999999.
print a[150001,2]
#999999.0

现在您可以在同步操作中访问和更新阵列的任何部分.请注意偏移量计算中的字节大小.因此,对于float64",此示例将是 150000*1000*64/8.

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a 'float64' this example would be 150000*1000*64/8.

其他参考:

numpy.memmap 文档在这里.

numpy.memmap documentation here.

这篇关于在 python 和 numpy 中处理大数据,内存不足,如何将部分结果保存在光盘上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆