大数据增量PCA [英] Incremental PCA on big data
问题描述
我刚刚尝试使用sklearn.decomposition中的IncrementalPCA,但它像以前的PCA和RandomizedPCA一样引发了MemoryError.我的问题是,我要加载的矩阵太大而无法放入RAM.现在,它以形状〜(1000000,1000)的数据集形式存储在hdf5数据库中,因此我有1.000.000.000 float32值.我以为IncrementalPCA可以分批加载数据,但是显然它试图加载整个数据集,这无济于事.如何使用该库? hdf5格式有问题吗?
I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5 database as dataset of shape ~(1000000, 1000), so I have 1.000.000.000 float32 values. I thought IncrementalPCA loads the data in batches, but apparently it tries to load the entire dataset, which does not help. How is this library meant to be used? Is the hdf5 format the problem?
from sklearn.decomposition import IncrementalPCA
import h5py
db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
X = check_array(X, dtype=np.float)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
array = np.atleast_2d(array)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
ary = asanyarray(ary)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError
感谢帮助
推荐答案
您的程序可能无法尝试将整个数据集加载到RAM中.每个float32 32×1,000,000×1000 32位为3.7 GiB.在只有4 GiB RAM的机器上,这可能是个问题.要检查实际上是否是问题,请尝试单独创建一个具有此大小的数组:
You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:
>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)
如果看到MemoryError
,则可能需要更多的RAM,或者需要一次处理一个数据块.
If you see a MemoryError
, you either need more RAM, or you need to process your dataset one chunk at a time.
对于h5py数据集,我们应该避免将整个数据集传递给我们的方法,而应传递数据集的切片.一次一个.
With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.
由于我没有您的数据,因此让我从创建相同大小的随机数据集开始:
As I don't have your data, let me start from creating a random dataset of the same size:
import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()
它会创建一个漂亮的3.8 GiB文件.
It creates a nice 3.8 GiB file.
现在,如果我们在Linux中,我们可以限制程序可以使用多少内存:
Now, if we are in Linux, we can limit how much memory is available to our program:
$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152
现在,如果我们尝试运行您的代码,我们将收到MemoryError. (按Ctrl-D退出新的bash会话并稍后重新设置限制)
Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)
让我们尝试解决问题.我们将创建一个IncrementalPCA对象,并将其称为 .partial_fit()
方法很多次,每次都提供不同的数据集切片.
Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit()
method many times, providing a different slice of the dataset each time.
import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA
h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet
n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)
for i in range(0, n//chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
这似乎对我有用,如果我查看top
的报告,则内存分配保持在200M以下.
It seems to be working for me, and if I look at what top
reports, the memory allocation stays below 200M.
这篇关于大数据增量PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!