使用memmap文件进行批处理 [英] Using memmap files for batch processing

查看:68
本文介绍了使用memmap文件进行批处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望PCA拥有一个庞大的数据集.我受制于PCA的RAM和计算效率. 因此,我转向使用迭代PCA.

I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA. Therefore, I shifted to using Iterative PCA.

数据集大小-(140000,3504)

Dataset Size-(140000,3504)

文档指出This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.

这真的很好,但是不确定如何利用它.

This is really good, but unsure on how take advantage of this.

我尝试加载一个内存映射,希望它可以成块访问,但我的RAM崩溃了. 我下面的代码最终使用了大量RAM:

I tried load one memmap hoping it would access it in chunks but my RAM blew. My code below ends up using a lot of RAM:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

当我说我的RAM自爆"时,我看到的回溯是:

When I say "my RAM blew", the Traceback I see is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

如何通过减小批量大小来改善精度而又不影响准确性?

How can I improve this without comprising on accuracy by reducing the batch-size?

我要诊断的想法:

我查看了sklearn源代码和fit()函数

I looked at the sklearn source code and in the fit() function Source Code I can see the following. This makes sense to me, but I am still unsure about what is wrong in my case.

for batch in gen_batches(n_samples, self.batch_size_):
        self.partial_fit(X[batch])
return self

最坏的情况下,我将不得不为iterativePCA编写自己的代码,该代码通过读取和关闭.npy文件来进行批处理.但这将破坏利用已经存在的黑客攻击的目的.

Worst case scenario I will have to write my own code for iterativePCA which batch processes by reading and closing .npy files. But that would defeat the purpose of taking advantage of already present hack.

Edit2: 如果可以,我可以删除一批已处理的memmap file.这很有道理.

If somehow I could delete a batch of processed memmap file. It would make much sense.

Edit3: 理想情况下,如果IncrementalPCA.fit()仅使用批处理,则不应使我的RAM崩溃.发布整个代码,只是为了确保在将memmap完全刷新到磁盘之前没有犯错.

Ideally if IncrementalPCA.fit() is just using batches it should not crash my RAM. Posting the whole code, just to make sure I am not making a mistake in flushing the memmap completely to disk before.

temp_train_data=X_train[1000:]
temp_labels=y[1000:] 
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
    actual_index=index+1000
    data=X_train[actual_index-1000:actual_index+1].ravel()
    __,cd_i=pywt.dwt(data,'haar')
    out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)

令人惊讶的是,我注意到out.flush并没有释放我的记忆.有没有一种方法可以使用del out完全释放我的内存,然后有人将文件的指针传递给IncrementalPCA.fit().

Surprisingly, I note out.flush doesn't free my memory. Is there a way to using del out to free my memory completely and then someone pass a pointer of the file to IncrementalPCA.fit().

推荐答案

在32位环境中,您遇到了sklearn的问题.我假设您使用的是np.float16,因为您处于32位环境中,因此需要使用它来创建memmap对象而不会产生麻木的错误.

You have hit a problem with sklearn in a 32 bit environment. I presume you are using np.float16 because you're in a 32 bit environment and you need that to allow you to create the memmap object without numpy thowing errors.

在64位环境中(在Windows上使用Python3.3 64位进行了测试),您的代码可以直接使用.因此,如果您有64位可用的计算机-安装python 64位和numpyscipyscikit-learn 64位,则一切正常.

In a 64 bit environment (tested with Python3.3 64 bit on Windows), your code just works out of the box. So, if you have a 64 bit computer available - install python 64-bit and numpy, scipy, scikit-learn 64 bit and you are good to go.

不幸的是,如果您不能执行此操作,则没有简单的解决方法.我已经在github上提出了一个问题,但要实现这一目标并不容易修补.根本的问题是在库中,如果您的类型为float16,则会触发数组到内存的副本.详细信息如下.

Unfortunately, if you cannot do this, there is no easy fix. I have raised an issue on github here, but it is not easy to patch. The fundamental problem is that within the library, if your type is float16, a copy of the array to memory is triggered. The detail of this is below.

因此,我希望您可以访问具有大量RAM的64位环境.如果不是这样,您将不得不自己拆分数组并进行批处理,这是一个相当大的任务...

So, I hope you have access to a 64 bit environment with plenty of RAM. If not, you will have to split up your array yourself and batch process it, a rather larger task...

NB 真的很高兴看到您去诊断问题的源头:)但是,如果您查看代码失败的那一行(来自Traceback) ),您将发现从未找到找到的for batch in gen_batches代码.

OP代码生成的实际错误:

The actual error generated by OP code:

import numpy as np
from sklearn.decomposition import IncrementalPCA

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

check_array(代码链接)使用dtype=np.float,但是原始数组具有dtype=np.float16.即使check_array()函数默认为copy=False

The call to check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs), to satisfy the requirement that the dtype is different; therefore a copy is made by np.array.

这可以通过确保为使用dtype in (np.float16, np.float32, np.float64)的数组保留dtype来在IncrementalPCA代码中解决.但是,当我尝试该补丁程序时,它只会将MemoryError推向执行链的另一端.

This could be solved in the IncrementalPCA code by ensuring that the dtype was preserved for arrays with dtype in (np.float16, np.float32, np.float64). However, when I tried that patch, it only pushed the MemoryError further along the chain of execution.

当代码从主要scipy代码调用linalg.svd() ,这一次在调用

The same copying problem occurs when the code calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd(), a wrapped native function from lapack. Thus, I do not think there is a way to patch this (at least not an easy way - it is at minimum alteration of code in core scipy).

这篇关于使用memmap文件进行批处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆