大数据增量PCA [英] Incremental PCA on big data

查看:294
本文介绍了大数据增量PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚尝试使用sklearn.decomposition中的IncrementalPCA,但它像以前的PCA和RandomizedPCA一样引发了MemoryError.我的问题是,我要加载的矩阵太大而无法放入RAM.现在,它以形状〜(1000000,1000)的数据集形式存储在hdf5数据库中,因此我有1.000.000.000 float32值.我以为IncrementalPCA可以分批加载数据,但是显然它试图加载整个数据集,这无济于事.如何使用该库? hdf5格式有问题吗?

I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5 database as dataset of shape ~(1000000, 1000), so I have 1.000.000.000 float32 values. I thought IncrementalPCA loads the data in batches, but apparently it tries to load the entire dataset, which does not help. How is this library meant to be used? Is the hdf5 format the problem?

from sklearn.decomposition import IncrementalPCA
import h5py

db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
    X = check_array(X, dtype=np.float)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
    array = np.atleast_2d(array)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
    ary = asanyarray(ary)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
    arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError

感谢帮助

推荐答案

您的程序可能无法尝试将整个数据集加载到RAM中.每个float32 32×1,000,000×1000 32位为3.7 GiB.在只有4 GiB RAM的机器上,这可能是个问题.要检查实际上是否是问题,请尝试单独创建一个具有此大小的数组:

You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:

>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)

如果看到MemoryError,则可能需要更多的RAM,或者需要一次处理一个数据块.

If you see a MemoryError, you either need more RAM, or you need to process your dataset one chunk at a time.

对于h5py数据集,我们应该避免将整个数据集传递给我们的方法,而应传递数据集的切片.一次一个.

With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.

由于我没有您的数据,因此让我从创建相同大小的随机数据集开始:

As I don't have your data, let me start from creating a random dataset of the same size:

import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
    h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()

它会创建一个漂亮的3.8 GiB文件.

It creates a nice 3.8 GiB file.

现在,如果我们在Linux中,我们可以限制程序可以使用多少内存:

Now, if we are in Linux, we can limit how much memory is available to our program:

$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152

现在,如果我们尝试运行您的代码,我们将收到MemoryError. (按Ctrl-D退出新的bash会话并稍后重新设置限制)

Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)

让我们尝试解决问题.我们将创建一个IncrementalPCA对象,并将其称为 .partial_fit() 方法很多次,每次都提供不同的数据集切片.

Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit() method many times, providing a different slice of the dataset each time.

import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA

h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet

n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)

for i in range(0, n//chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])

这似乎对我有用,如果我查看top的报告,则内存分配保持在200M以下.

It seems to be working for me, and if I look at what top reports, the memory allocation stays below 200M.

这篇关于大数据增量PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆