sklearn和大型数据集 [英] sklearn and large datasets

查看：87 发布时间：2021/4/15 19:25:09 python bigdata scikit-learn

本文介绍了sklearn和大型数据集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我有一个22 GB的数据集.我想在笔记本电脑上处理它.当然，我无法将其加载到内存中.

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.

我使用了很多sklearn，但是用于更小的数据集.

I use a lot sklearn but for much smaller datasets.

在这种情况下，经典方法应该类似于.

In this situations the classical approach should be something like.

仅读取部分数据->部分训练您的估算器->删除数据->读取数据的其他部分->继续训练您的估算器.

Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.

我已经看到某些sklearn算法具有部分拟合方法，该方法应允许我们使用数据的各种子样本来训练估计量.

I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.

现在我想知道为什么在sklearn中这样做很简单?我正在寻找

Now I am wondering is there an easy why to do that in sklearn? I am looking for something like

r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
     x = r.read_next_chunk(20 lines)
     m.partial_fit(x)

m.predict(new_x)

也许sklearn不是用于这类事情的正确工具?让我知道.

Maybe sklearn is not the right tool for these kind of things? Let me know.