大数据上的增量 PCA [英] Incremental PCA on big data

查看:38
本文介绍了大数据上的增量 PCA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚尝试使用 sklearn.decomposition 中的 IncrementalPCA,但它引发了 MemoryError,就像之前的 PCA 和 RandomizedPCA 一样.我的问题是,我尝试加载的矩阵太大而无法放入 RAM.现在它作为形状 ~(1000000, 1000) 的数据集存储在 hdf5 数据库中,所以我有 1.000.000.000 float32 值.我认为 IncrementalPCA 分批加载数据,但显然它试图加载整个数据集,这无济于事.这个库是如何使用的?hdf5 格式有问题吗?

from sklearn.decomposition import IncrementalPCA导入 h5pydb = h5py.File("db.h5","r")数据 = db["数据"]IncrementalPCA(n_components=10, batch_size=1).fit(data)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中文件/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py",第165行,合适X = check_array(X, dtype=np.float)check_array 中的文件/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py",第 337 行数组 = np.atleast_2d(array)文件/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py",第 99 行,在 atleast_2dary = asanyarray(ary)文件/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py",第514行,asanyarray返回数组(a,dtype,copy=False,order=order,subok=True)文件h5py/_objects.pyx",第 54 行,在 h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)文件h5py/_objects.pyx",第 55 行,在 h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)文件/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py",第 640 行,在 __array__ 中arr = numpy.empty(self.shape, dtype=self.dtype 如果 dtype 是 None else dtype)内存错误

感谢帮助

解决方案

您的程序可能无法将整个数据集加载到 RAM 中.每个 float32 × 1,000,000 × 1000 的 32 位是 3.7 GiB.在只有 4 GiB RAM 的机器上,这可能是一个问题.要检查是否确实是问题所在,请尝试单独创建一个这样大小的数组:

<预><代码>>>>将 numpy 导入为 np>>>np.zeros((1000000, 1000), dtype=np.float32)

如果您看到 MemoryError,则要么需要更多 RAM,要么需要一次处理一个数据集.

对于 h5py 数据集,我们应该避免将整个数据集传递给我们的方法,而是传递数据集的切片.一次一个.

因为我没有你的数据,让我从创建一个相同大小的随机数据集开始:

导入h5py将 numpy 导入为 nph5 = h5py.File('rand-1Mx1K.h5', 'w')h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)对于我在范围内(1000):h5['数据'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)h5.close()

它创建了一个不错的 3.8 GiB 文件.

现在,如果我们在 Linux 中,我们可以限制程序可用的内存量:

$ bash$ ulimit -m $((1024*1024*2))$ ulimit -m2097152

现在,如果我们尝试运行您的代码,我们将收到 MemoryError.(按 Ctrl-D 退出新的 bash 会话并稍后重置限制)

让我们尝试解决问题.我们将创建一个 IncrementalPCA 对象,并调用它的 .partial_fit() 方法多次,每次提供不同的数据集切片.

导入h5py将 numpy 导入为 np从 sklearn.decomposition 导入 IncrementalPCAh5 = h5py.File('rand-1Mx1K.h5', 'r')data = h5['data'] # 没关系,数据集还没有被提取到内存中n = data.shape[0] # 我们在数据集中有多少行chunk_size = 1000 # 我们一次向 IPCA 提供多少行,n 的除数ipca = IncrementalPCA(n_components=10,batch_size=16)对于范围内的 i(0, n//chunk_size):ipca.partial_fit(数据[i*chunk_size : (i+1)*chunk_size])

它似乎对我有用,如果我查看 top 报告的内容,内存分配保持在 200M 以下.

I just tried using the IncrementalPCA from sklearn.decomposition, but it threw a MemoryError just like the PCA and RandomizedPCA before. My problem is, that the matrix I am trying to load is too big to fit into RAM. Right now it is stored in an hdf5 database as dataset of shape ~(1000000, 1000), so I have 1.000.000.000 float32 values. I thought IncrementalPCA loads the data in batches, but apparently it tries to load the entire dataset, which does not help. How is this library meant to be used? Is the hdf5 format the problem?

from sklearn.decomposition import IncrementalPCA
import h5py

db = h5py.File("db.h5","r")
data = db["data"]
IncrementalPCA(n_components=10, batch_size=1).fit(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/decomposition/incremental_pca.py", line 165, in fit
    X = check_array(X, dtype=np.float)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 337, in check_array
    array = np.atleast_2d(array)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/shape_base.py", line 99, in atleast_2d
    ary = asanyarray(ary)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/numpy/core/numeric.py", line 514, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2458)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (-------src-dir-------/h5py/_objects.c:2415)
  File "/software/anaconda/2.3.0/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 640, in __array__
    arr = numpy.empty(self.shape, dtype=self.dtype if dtype is None else dtype)
MemoryError

Thanks for help

解决方案

You program is probably failing in trying to load the entire dataset into RAM. 32 bits per float32 × 1,000,000 × 1000 is 3.7 GiB. That can be a problem on machines with only 4 GiB RAM. To check that it's actually the problem, try creating an array of this size alone:

>>> import numpy as np
>>> np.zeros((1000000, 1000), dtype=np.float32)

If you see a MemoryError, you either need more RAM, or you need to process your dataset one chunk at a time.

With h5py datasets we just should avoid passing the entire dataset to our methods, and pass slices of the dataset instead. One at a time.

As I don't have your data, let me start from creating a random dataset of the same size:

import h5py
import numpy as np
h5 = h5py.File('rand-1Mx1K.h5', 'w')
h5.create_dataset('data', shape=(1000000,1000), dtype=np.float32)
for i in range(1000):
    h5['data'][i*1000:(i+1)*1000] = np.random.rand(1000, 1000)
h5.close()

It creates a nice 3.8 GiB file.

Now, if we are in Linux, we can limit how much memory is available to our program:

$ bash
$ ulimit -m $((1024*1024*2))
$ ulimit -m
2097152

Now if we try to run your code, we'll get the MemoryError. (press Ctrl-D to quit the new bash session and reset the limit later)

Let's try to solve the problem. We'll create an IncrementalPCA object, and will call its .partial_fit() method many times, providing a different slice of the dataset each time.

import h5py
import numpy as np
from sklearn.decomposition import IncrementalPCA

h5 = h5py.File('rand-1Mx1K.h5', 'r')
data = h5['data'] # it's ok, the dataset is not fetched to memory yet

n = data.shape[0] # how many rows we have in the dataset
chunk_size = 1000 # how many rows we feed to IPCA at a time, the divisor of n
ipca = IncrementalPCA(n_components=10, batch_size=16)

for i in range(0, n//chunk_size):
    ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])

It seems to be working for me, and if I look at what top reports, the memory allocation stays below 200M.

这篇关于大数据上的增量 PCA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆