如何使用sklearn的IncrementalPCA partial_fit [英] How to use sklearn's IncrementalPCA partial_fit
问题描述
我有一个相当大的数据集,我想分解它,但是太大了,无法加载到内存中.研究我的选择之后,看来 sklearn的IncrementalPCA 是一个不错的选择,但是我还不太清楚如何使它工作.
I've got a rather large dataset that I would like to decompose but is too big to load into memory. Researching my options, it seems that sklearn's IncrementalPCA is a good choice, but I can't quite figure out how to make it work.
我可以很好地加载数据:
I can load in the data just fine:
f = h5py.File('my_big_data.h5')
features = f['data']
从此示例中,看来我需要确定我要从中读取多少大小的块:>
And from this example, it seems I need to decide what size chunks I want to read from it:
num_rows = data.shape[0] # total number of rows in data
chunk_size = 10 # how many rows at a time to feed ipca
然后,我可以创建我的IncrementalPCA,逐块流传输数据,并部分适应数据(同样来自上面的示例):
Then I can create my IncrementalPCA, stream the data chunk-by-chunk, and partially fit it (also from the example above):
ipca = IncrementalPCA(n_components=2)
for i in range(0, num_rows//chunk_size):
ipca.partial_fit(features[i*chunk_size : (i+1)*chunk_size])
这一切都没有错误,但是我不确定下一步该怎么做.我实际上如何进行降维并获得一个可以进一步处理并保存的新的numpy数组?
This all goes without error, but I'm not sure what to do next. How do I actually do the dimension reduction and get a new numpy array I can manipulate further and save?
编辑
上面的代码用于测试我的较小数据子集-正如@ImanolLuengo正确指出的那样,在最终代码中使用大量尺寸和块大小会更好.
EDIT
The code above was for testing on a smaller subset of my data – as @ImanolLuengo correctly points out, it would be way better to use a larger number of dimensions and chunk size in the final code.
推荐答案
您很好地猜想拟合是正确的,尽管我建议将chunk_size
增加到100或1000(或更高,取决于形状)您的数据).
As you well guessed the fitting is done properly, although I would suggest increasing the chunk_size
to 100 or 1000 (or even higher, depending on the shape of your data).
您现在要做的实际上是对其进行转换:
What you have to do now to transform it, is actually transforming it:
out = my_new_features_dataset # shape N x 2
for i in range(0, num_rows//chunk_size):
out[i*chunk_size:(i+1) * chunk_size] = ipca.transform(features[i*chunk_size : (i+1)*chunk_size])
多数民众赞成在应该为您提供新的转换功能.如果您仍然有太多样本无法容纳在内存中,建议将out
用作另一个hdf5数据集.
And thats should give you your new transformed features. If you still have too many samples to fit in memory, I would suggest using out
as another hdf5 dataset.
此外,我认为将庞大的数据集减少为2个组件可能不是一个好主意.但是在不知道features
形状的情况下很难说.我建议将它们减少为sqrt(features.shape[1])
,因为这是一个不错的启发式方法或提示:请使用ipca.explained_variance_ratio_
来确定可负担得起的信息丢失阈值的最佳功能.
Also, I would argue that reducing a huge dataset to 2 components is probably not a very good idea. But is hard to say without knowing the shape of your features
. I would suggest reducing them to sqrt(features.shape[1])
, as it is a decent heuristic, or pro tip: use ipca.explained_variance_ratio_
to determine the best amount of features for your affordable information loss threshold.
对于explained_variance_ratio_
,它返回尺寸为n_components
(作为参数传递给IPCA的n_components
)的向量,其中每个值 i 表示由第 i 个新组件解释的原始数据的差异.
as for the explained_variance_ratio_
, it returns a vector of dimension n_components
(the n_components
that you pass as parameter to IPCA) where each value i inicates the percentage of the variance of your original data explained by the i-th new component.
您可以按照此答案中的步骤操作,以提取第一个 n 组件:
You can follow the procedure in this answer to extract how much information is preserved by the first n components:
>>> print(ipca.explained_variance_ratio_.cumsum())
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
注意:假设您已将IPCA减少到5个组成部分,则数字是从上述答案中得出的数字.第 i 个数字表示最初的[0,i]分量解释了多少原始数据,因为它是所解释的方差比的累积和.
Note: numbers are ficticius taken from the answer above assuming that you have reduced IPCA to 5 components. The i-th number indicates how much of the original data is explained by the first [0, i] components, as it is the cummulative sum of the explained variance ratio.
因此,通常要做的是使PCA与原始数据具有相同数量的组件:
Thus, what is usually done, is to fit your PCA to the same number of components than your original data:
ipca = IncrementalPCA(n_components=features.shape[1])
然后,在对整个数据进行了训练(使用迭代+ partial_fit
)之后,您可以绘制explaine_variance_ratio_.cumsum()
并选择要丢失的数据量.或自动执行:
Then, after training on your whole data (with iteration + partial_fit
) you can plot explaine_variance_ratio_.cumsum()
and choose how much data you want to lose. Or do it automatically:
k = np.argmax(ipca.explained_variance_ratio_.cumsum() > 0.9)
上面的代码将返回cumcum数组上的第一个索引,其值为> 0.9
,这表示保留至少90%原始数据的PCA组件的数量.
The above will return the first index on the cumcum array where the value is > 0.9
, this is, indicating the number of PCA components that preserve at least 90% of the original data.
然后,您可以周转以反映该变化:
Then you can tweek the transformation to reflect it:
cs = chunk_size
out = my_new_features_dataset # shape N x k
for i in range(0, num_rows//chunk_size):
out[i*cs:(i+1)*cs] = ipca.transform(features[i*cs:(i+1)*cs])[:, :k]
请注意切片到:k
,仅选择第一个k
组件,而忽略其余部分.
NOTE the slicing to :k
to just select only the first k
components while ignoring the rest.
这篇关于如何使用sklearn的IncrementalPCA partial_fit的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!