如何使用sklearn的IncrementalPCA partial_fit [英] How to use sklearn's IncrementalPCA partial_fit

查看:219
本文介绍了如何使用sklearn的IncrementalPCA partial_fit的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的数据集,我想分解它,但是太大了,无法加载到内存中.研究我的选择之后,看来 sklearn的IncrementalPCA 是一个不错的选择,但是我还不太清楚如何使它工作.

I've got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn's IncrementalPCA is a good choice, but I can't quite figure out how to make it work.

我可以很好地加载数据:

I can load in the data just fine:

f = h5py.File('my_big_data.h5')
features = f['data']

此示例中,看来我需要确定我要从中读取多少大小的块:

And from this example, it seems I need to decide what size chunks I want to read from it:

num_rows = data.shape[0]     # total number of rows in data
chunk_size = 10              # how many rows at a time to feed ipca

然后,我可以创建我的IncrementalPCA,逐块流传输数据,并部分适应数据(同样来自上面的示例):

Then I can create my IncrementalPCA, stream the data chunk-by-chunk, and partially fit it (also from the example above):

ipca = IncrementalPCA(n_components=2)
for i in range(0, num_rows//chunk_size):
    ipca.partial_fit(features[i*chunk_size : (i+1)*chunk_size])

这一切都没有错误,但是我不确定下一步该怎么做.我实际上如何进行降维并获得一个可以进一步处理并保存的新的numpy数组?

This all goes without error, but I'm not sure what to do next. How do I actually do the dimension reduction and get a new numpy array I can manipulate further and save?

编辑
上面的代码用于测试我的较小数据子集-正如@ImanolLuengo正确指出的那样,在最终代码中使用大量尺寸和块大小会更好.

EDIT
The code above was for testing on a smaller subset of my data – as @ImanolLuengo correctly points out, it would be way better to use a larger number of dimensions and chunk size in the final code.

推荐答案

您很好地猜想拟合是正确的,尽管我建议将chunk_size增加到100或1000(或更高,取决于形状)您的数据).

As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size to 100 or 1000 (or even higher, depending on the shape of your data).

您现在要做的实际上是对其进行转换:

What you have to do now to transform it, is actually transforming it:

out = my_new_features_dataset # shape N x 2
for i in range(0, num_rows//chunk_size):
    out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])

多数民众赞成在应该为您提供新的转换功能.如果您仍然有太多样本无法容纳在内存中,建议将out用作另一个hdf5数据集.

And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out as another hdf5 dataset.

此外,我认为将庞大的数据集减少为2个组件可能不是一个好主意.但是在不知道features形状的情况下很难说.我建议将它们减少为sqrt(features.shape[1]),因为这是一个不错的启发式方法或提示:请使用ipca.explained_variance_ratio_来确定可负担得起的信息丢失阈值的最佳功能.

Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features. I would suggest reducing them to sqrt(features.shape[1]), as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_ to determine the best amount of features for your affordable information loss threshold.

对于explained_variance_ratio_,它返回尺寸为n_components(作为参数传递给IPCA的n_components)的向量,其中每个值 i 表示由第 i 个新组件解释的原始数据的差异.

as for the explained_variance_ratio_, it returns a vector of dimension n_components (the n_components that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.

您可以按照此答案中的步骤操作,以提取第一个 n 组件:

You can follow the procedure in this answer to extract how much information is preserved by the first n components:

>>> print(ipca.explained_variance_ratio_.cumsum())
[ 0.32047581  0.59549787  0.80178824  0.932976    1.        ]

注意:假设您已将IPCA减少到5个组成部分,则数字是从上述答案中得出的数字.第 i 个数字表示最初的[0,i]分量解释了多少原始数据,因为它是所解释的方差比的累积和.

Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.

因此,通常要做的是使PCA与原始数据具有相同数量的组件:

Thus, what is usually done, is to fit your PCA to the same number of components than your original data:

ipca = IncrementalPCA(n_components=features.shape[1])

然后,在对整个数据进行了训练(使用迭代+ partial_fit)之后,您可以绘制explaine_variance_ratio_.cumsum()并选择要丢失的数据量.或自动执行:

Then, after training on your whole data (with iteration + partial_fit) you can plot explaine_variance_ratio_.cumsum() and choose how much data you want to lose. Or do it automatically:

k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)

上面的代码将返回cumcum数组上的第一个索引,其值为> 0.9,这表示保留至少90%原始数据的PCA组件的数量.

The above will return the first index on the cumcum array where the value is > 0.9, this is, indicating the number of PCA components that preserve at least 90% of the original data.

然后,您可以周转以反映该变化:

Then you can tweek the transformation to reflect it:

cs = chunk_size
out = my_new_features_dataset # shape N x k
for i in range(0, num_rows//chunk_size):
    out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]

请注意切片到:k,仅选择第一个k组件,而忽略其余部分.

NOTE the slicing to :k to just select only the first k components while ignoring the rest.

这篇关于如何使用sklearn的IncrementalPCA partial_fit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆