大型数据集上的Sklearn-GMM [英] Sklearn-GMM on large datasets

查看：536 发布时间：2020/6/30 22:43:29 python scikit-learn bigdata mixture-model

本文介绍了大型数据集上的Sklearn-GMM的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个很大的数据集(我无法在内存中容纳全部数据).我想在此数据集上使用GMM.

I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set.

我可以对小批量数据重复使用GMM.fit()(sklearn.mixture.GMM)吗?

Can I use GMM.fit() (sklearn.mixture.GMM) repeatedly on mini batch of data ??

推荐答案

没有理由反复进行调整. 您可以在合理的时间内随机采样尽可能多的数据点.如果变异不是很高，则随机样本的分布将与整个数据集大致相同.

There is no reason to fit it repeatedly. Just randomly sample as many data points as you think your machine can compute in a reasonable time. If variation is not very high, the random sample will have approximately the same distribution as the full dataset.

randomly_sampled = np.random.choice(full_dataset, size=10000, replace=False)
#If data does not fit in memory you can find a way to randomly sample when you read it

GMM.fit(randomly_sampled)

和用途

GMM.predict(full_dataset)
# Again you can fit one by one or batch by batch if you cannot read it in memory

对其余的进行分类.

这篇关于大型数据集上的Sklearn-GMM的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

大型数据集上的Sklearn-GMM [英] Sklearn-GMM on large datasets

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

大型数据集上的Sklearn-GMM [英] Sklearn-GMM on large datasets

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭