使用 memmap 文件进行批处理 [英] Using memmap files for batch processing

查看:20
本文介绍了使用 memmap 文件进行批处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个庞大的数据集,我希望对其进行 PCA.我受到 PCA 的 RAM 和计算效率的限制.因此,我转而使用迭代 PCA.

数据集大小-(140000,3504)

文档指出This算法具有恒定的内存复杂度,数量级为 batch_size,允许使用 np.memmap 文件而无需将整个文件加载到内存中.

这真的很好,但不确定如何利用这一点.

我尝试加载一个 memmap,希望它能分块访问它,但我的 RAM 爆了.我下面的代码最终使用了大量 RAM:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))clf=IncrementalPCA(副本=假)X_train=clf.fit_transform(ut)

当我说我的内存坏了"时,我看到的 Traceback 是:

回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件C:Python27libsite-packagessklearnase.py",第 433 行,在 fit_transfoR M返回 self.fit(X, **fit_params).transform(X)文件C:Python27libsite-packagessklearndecompositionincremental_pca.py",第 171 行,合身X = check_array(X, dtype=np.float)文件C:Python27libsite-packagessklearnutilsvalidation.py",第 347 行,在检查数组array = np.array(array, dtype=dtype, order=order, copy=copy)内存错误

如何在不影响准确性的情况下通过减少批量大小来改进这一点?

<小时>

我的诊断思路:

我查看了 sklearn 源代码和 fit() 函数 源代码 我可以看到以下内容.这对我来说很有意义,但我仍然不确定我的情况出了什么问题.

 用于 gen_batches(n_samples, self.batch_size_) 中的批处理:self.partial_fit(X[batch])回归自我

最坏的情况我将不得不为 iterativePCA 编写自己的代码,它通过读取和关闭 .npy 文件进行批处理.但这会破坏利用现有黑客的目的.

编辑 2:如果我能以某种方式删除一批处理过的 memmap 文件.这很有意义.

编辑 3:理想情况下,如果 IncrementalPCA.fit() 只是使用批处理,它不应该使我的 RAM 崩溃.发布整个代码,只是为了确保我之前在将 memmap 完全刷新到磁盘时没有犯错误.

temp_train_data=X_train[1000:]temp_labels=y[1000:]out = np.empty((200001, 3504), np.int64)对于索引,枚举中的行(temp_train_data):实际索引=索引+1000数据=X_train[actual_index-1000:actual_index+1].ravel()__,cd_i=pywt.dwt(data,'haar')出[索引] = cd_iout.flush()pca_obj=增量PCA()clf = pca_obj.fit(out)

令人惊讶的是,我注意到 out.flush 并没有释放我的记忆.有没有办法使用 del out 完全释放我的内存,然后有人将文件的指针传递给 IncrementalPCA.fit().

解决方案

您在 32 位环境中遇到了 sklearn 问题.我假设您正在使用 np.float16 是因为您处于 32 位环境中,并且您需要它来允许您创建 memmap 对象而不会出现 numpy thowing 错误.>

在 64 位环境中(在 Windows 上使用 Python3.3 64 位测试),您的代码开箱即用.所以,如果你有一台 64 位的计算机 - 安装 python 64 位和 numpyscipyscikit-learn 64 位,你就很高兴去.

不幸的是,如果您不能这样做,则没有简单的解决方法.我已经在这里在 github 上提出了一个问题,但并不容易修补.根本问题是,在库中,如果您的类型是 float16,则会触发数组到内存的副本.详情如下.

所以,我希望您可以访问具有大量 RAM 的 64 位环境.如果没有,您将不得不自己拆分阵列并对其进行批处理,这是一项相当大的任务...

NB 很高兴看到您去源代码诊断您的问题:) 但是,如果您查看代码失败的行(来自 Traceback),您将看到您找到的 for batch in gen_batches 代码从未到达.

<小时>

详细诊断:

OP代码产生的实际错误:

将 numpy 导入为 np从 sklearn.decomposition 导入 IncrementalPCAut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))clf=IncrementalPCA(副本=假)X_train=clf.fit_transform(ut)

回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件C:Python27libsite-packagessklearnase.py",第 433 行,在 fit_transfoR M返回 self.fit(X, **fit_params).transform(X)文件C:Python27libsite-packagessklearndecompositionincremental_pca.py",第 171 行,合身X = check_array(X, dtype=np.float)文件C:Python27libsite-packagessklearnutilsvalidation.py",第 347 行,在检查数组array = np.array(array, dtype=dtype, order=order, copy=copy)内存错误

调用check_array(代码链接) 使用了dtype=np.float,但是原始数组有dtype=np.float16.即使 check_array() 函数 默认为 copy=False将此传递给 np.array(),这被忽略(根据文档),以满足 dtype是不同的;因此副本是由 np.array 制作的.

这可以在 IncrementalPCA 代码中解决,方法是确保 dtype 保留用于 dtype in (np.float16, np.float32, np.float64).但是,当我尝试使用该补丁时,它只会在执行链中进一步推动 MemoryError.

代码从主 scipy 代码调用 linalg.svd(),这次错误发生在调用 gesdd(),一个来自 lapack<的封装本机函数/代码>.因此,我认为没有办法修补这个问题(至少不是一个简单的方法——它至少是对核心 scipy 中的代码进行最少的改动).

I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA. Therefore, I shifted to using Iterative PCA.

Dataset Size-(140000,3504)

The documentation states that This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.

This is really good, but unsure on how take advantage of this.

I tried load one memmap hoping it would access it in chunks but my RAM blew. My code below ends up using a lot of RAM:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

When I say "my RAM blew", the Traceback I see is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:Python27libsite-packagessklearnase.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:Python27libsite-packagessklearndecompositionincremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:Python27libsite-packagessklearnutilsvalidation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

How can I improve this without comprising on accuracy by reducing the batch-size?


My ideas to diagnose:

I looked at the sklearn source code and in the fit() function Source Code I can see the following. This makes sense to me, but I am still unsure about what is wrong in my case.

for batch in gen_batches(n_samples, self.batch_size_):
        self.partial_fit(X[batch])
return self

Edit: Worst case scenario I will have to write my own code for iterativePCA which batch processes by reading and closing .npy files. But that would defeat the purpose of taking advantage of already present hack.

Edit2: If somehow I could delete a batch of processed memmap file. It would make much sense.

Edit3: Ideally if IncrementalPCA.fit() is just using batches it should not crash my RAM. Posting the whole code, just to make sure I am not making a mistake in flushing the memmap completely to disk before.

temp_train_data=X_train[1000:]
temp_labels=y[1000:] 
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
    actual_index=index+1000
    data=X_train[actual_index-1000:actual_index+1].ravel()
    __,cd_i=pywt.dwt(data,'haar')
    out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)

Surprisingly, I note out.flush doesn't free my memory. Is there a way to using del out to free my memory completely and then someone pass a pointer of the file to IncrementalPCA.fit().

解决方案

You have hit a problem with sklearn in a 32 bit environment. I presume you are using np.float16 because you're in a 32 bit environment and you need that to allow you to create the memmap object without numpy thowing errors.

In a 64 bit environment (tested with Python3.3 64 bit on Windows), your code just works out of the box. So, if you have a 64 bit computer available - install python 64-bit and numpy, scipy, scikit-learn 64 bit and you are good to go.

Unfortunately, if you cannot do this, there is no easy fix. I have raised an issue on github here, but it is not easy to patch. The fundamental problem is that within the library, if your type is float16, a copy of the array to memory is triggered. The detail of this is below.

So, I hope you have access to a 64 bit environment with plenty of RAM. If not, you will have to split up your array yourself and batch process it, a rather larger task...

N.B It's really good to see you going to the source to diagnose your problem :) However, if you look at the line where the code fails (from the Traceback), you will see that the for batch in gen_batches code that you found is never reached.


Detailed diagnosis:

The actual error generated by OP code:

import numpy as np
from sklearn.decomposition import IncrementalPCA

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

is

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:Python27libsite-packagessklearnase.py", line 433, in fit_transfo
rm
    return self.fit(X, **fit_params).transform(X)
  File "C:Python27libsite-packagessklearndecompositionincremental_pca.py",
 line 171, in fit
    X = check_array(X, dtype=np.float)
  File "C:Python27libsite-packagessklearnutilsvalidation.py", line 347, in
 check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError

The call to check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs), to satisfy the requirement that the dtype is different; therefore a copy is made by np.array.

This could be solved in the IncrementalPCA code by ensuring that the dtype was preserved for arrays with dtype in (np.float16, np.float32, np.float64). However, when I tried that patch, it only pushed the MemoryError further along the chain of execution.

The same copying problem occurs when the code calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd(), a wrapped native function from lapack. Thus, I do not think there is a way to patch this (at least not an easy way - it is at minimum alteration of code in core scipy).

这篇关于使用 memmap 文件进行批处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆