在Python和numpy的大数据工作,没有足够的RAM,如何保存到光盘部分结果? [英] Working with big data in python and numpy, not enough ram, how to save partial results on disc?

查看:337
本文介绍了在Python和numpy的大数据工作,没有足够的RAM,如何保存到光盘部分结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现算法的1000维数据与蟒蛇200K +数据点。我想用numpy的,SciPy的,sklearn,networkx和其他有用的库。我想对所有点进行,如所有点之间的成对距离操作,做集群。我已经实现了执行什么,我想以合理的复杂性,但是当我试图扩大他们我所有的数据,我用完了公羊工作的算法。我当然知道,在200K +数据创建成对距离矩阵需要的内存很多。

I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx and other usefull libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of ram. Of course I do, creating the matrix for pairwise distances on 200k+ data takes alot of memory.

下面来渔获:我真的想这样做与少量的RAM蹩脚的电脑。

Here comes the catch: I would really like to do this on crappy computers with low amounts of ram.

有我,使这项工作不低RAM的约束的可行之路。这将花费更长的时间实在是没有问题的,只要时间请求数千万不要去无限!

Is there a feasible way for me to make this work without the constraints of low ram. That it will take a much longer time is really not a problem, as long as the time reqs don't go to infinity!

我希望能够把我的算法工作,然后回来一个小时或五晚,而不是把它卡住,因为它跑出公羊!我想实现这蟒蛇,并能够使用numpy的,SciPy的,sklearn和networkx库。我想能够计算到我的所有点等成对距离

I would like to be able to put my algorithms to work and then come back an hour or five later and not have it stuck because it ran out of ram! I would like to implement this in python, and be able to use the numpy, scipy, sklearn and networkx libraries. I would like to be able to calculate the pairwise distance to all my points etc

这是可行的?我怎么会去了解它,我能开始在读了?

Is this feasible? And how would I go about it, what can I start to read up on?

最好的问候
//幻术师

Best regards // Mesmer

推荐答案

使用 numpy.memmap 创建直接映射到一个文件数组:

Using numpy.memmap you create arrays directly mapped into a file:

import numpy
a = numpy.memmap('test.mymemmap', dtype='float32', mode='w+', shape=(200000,1000))
# here you will see a 762MB file created in your working directory    

您可以把它当作一个常规的数组:
    A + = 1000。

You can treat it as a conventional array: a += 1000.

这是可能的,即使分配多个数组到同一个文件,如果需要,从互相源控制它。但是,我在这里经历过一些棘手的事情。要打开全阵列你要关闭previous一条第一,使用删除

It is possible even to assign more arrays to the same file, controlling it from mutually sources if needed. But I've experiences some tricky things here. To open the full array you have to "close" the previous one first, using del:

del a    
b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(200000,1000))

但是开通仅阵列的某些部分,能够实现同时控制

But openning only some part of the array makes it possible to achieve the simultaneous control:

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000))
b[1,5] = 123456.
print a[1,5]
#123456.0

大! A 被加上更改。B 。而且变化已经写入到磁盘上。

Great! a was changed together with b. And the changes are already written on disk.

值得评论的另一个重要的事情是偏移。假设你想利用不是第2行 B ,但线条150000和150001。

The other important thing worth commenting is the offset. Suppose you want to take not the first 2 lines in b, but lines 150000 and 150001.

b = numpy.memmap('test.mymemmap', dtype='float32', mode='r+', shape=(2,1000),
                 offset=150000*1000*32/8)
b[1,2] = 999999.
print a[150001,2]
#999999.0

现在,您可以访问和更新的同时操作数组的任何部分。注意字节大小的偏移计算下去。因此,对于float64这个例子是150000 * 1000 * 64/8

Now you can access and update any part of the array in simultaneous operations. Note the byte-size going in the offset calculation. So for a 'float64' this example would be 150000*1000*64/8.

其他参考资料:

这里 numpy.memmap 文件。

numpy.memmap documentation here.

这篇关于在Python和numpy的大数据工作,没有足够的RAM,如何保存到光盘部分结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆