大数据上的增量 PCA [英] Incremental PCA on big data
问题描述
我刚刚尝试使用 sklearn.decomposition 中的 IncrementalPCA,但它引发了 MemoryError,就像之前的 PCA 和 RandomizedPCA 一样.我的问题是,我尝试加载的矩阵太大而无法放入 RAM.现在它作为形状 ~(1000000, 1000) 的数据集存储在 hdf5 数据库中,所以我有 1.000.000.000 float32 值.我认为 IncrementalPCA 分批加载数据,但显然它试图加载整个数据集,这无济于事.这个库是如何使用的?hdf5 格式有问题吗?
from sklearn.decomposition import IncrementalPCA导入 h5pydb = h5py.File("db.h5","r")数据 = db["数据"]IncrementalPCA(n_components=10, batch_size=1).fit(data)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py",第165行,合适X = check_array(X, dtype=np.float)check_array 中的文件/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py",第 337 行数组 = np.atleast_2d(array)文件/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py",第 99 行,在 atleast_2dary = asanyarray(ary)文件/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py",第514行,asanyarray返回数组(a,dtype,copy=False,order=order,subok=True)文件h5py/_objects.pyx",第 54 行,在 h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)文件h5py/_objects.pyx",第 55 行,在 h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)文件/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py",第 640 行,在 __array__ 中arr = numpy.empty(self.shape, dtype=self.dtype 如果 dtype 是 None else dtype)内存错误
感谢帮助
您的程序可能无法将整个数据集加载到 RAM 中.每个 float32 × 1,000,000 × 1000 的 32 位是 3.7 GiB.在只有 4 GiB RAM 的机器上,这可能是一个问题.要检查是否确实是问题所在,请尝试单独创建一个这样大小的数组:
<预><代码>>>>将 numpy 导入为 np>>>np.zeros((1000000, 1000), dtype=np.float32)如果您看到 MemoryError
,则要么需要更多 RAM,要么需要一次处理一个数据集.
对于 h5py 数据集,我们应该避免将整个数据集传递给我们的方法,而是传递数据集的切片.一次一个.
因为我没有你的数据,让我从创建一个相同大小的随机数据集开始:
导入h5py将 numpy 导入为 nph5 = h5py.File('rand-1Mx1K.h5', 'w')h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)对于我在范围内(1000):h5['数据'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)h5.close()
它创建了一个不错的 3.8 GiB 文件.
现在,如果我们在 Linux 中,我们可以限制程序可用的内存量:
$ bash$ ulimit -m $((1024*1024*2))$ ulimit -m2097152
现在,如果我们尝试运行您的代码,我们将收到 MemoryError.(按 Ctrl-D 退出新的 bash 会话并稍后重置限制)
让我们尝试解决问题.我们将创建一个 IncrementalPCA 对象,并调用它的 .partial_fit()
方法多次,每次提供不同的数据集切片.
导入h5py将 numpy 导入为 np从 sklearn.decomposition 导入 IncrementalPCAh5 = h5py.File('rand-1Mx1K.h5', 'r')data = h5['data'] # 没关系,数据集还没有被提取到内存中n = data.shape[0] # 我们在数据集中有多少行chunk_size = 1000 # 我们一次向 IPCA 提供多少行,n 的除数ipca = IncrementalPCA(n_components=10,batch_size=16)对于范围内的 i(0, n//chunk_size):ipca.partial_fit(数据[i*chunk_size : (i+1)*chunk_size])
它似乎对我有用,如果我查看 top
报告的内容,内存分配保持在 200M 以下.
I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5 database as dataset of shape ~(1000000, 1000), so I have 1.000.000.000 float32 values. I thought IncrementalPCA loads the data in batches, but apparently it tries to load the entire dataset, which does not help. How is this library meant to be used? Is the hdf5 format the problem?
from sklearn.decomposition import IncrementalPCA
import h5py
db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
X = check_array(X, dtype=np.float)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
array = np.atleast_2d(array)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
ary = asanyarray(ary)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
return array(a, dtype, copy=False, order=order, subok=True)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError
Thanks for help
You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:
>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)
If you see a MemoryError
, you either need more RAM, or you need to process your dataset one chunk at a time.
With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.
As I don't have your data, let me start from creating a random dataset of the same size:
import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()
It creates a nice 3.8 GiB file.
Now, if we are in Linux, we can limit how much memory is available to our program:
$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152
Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)
Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit()
method many times, providing a different slice of the dataset each time.
import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA
h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet
n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)
for i in range(0, n//chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
It seems to be working for me, and if I look at what top
reports, the memory allocation stays below 200M.
这篇关于大数据上的增量 PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!