sklearn和大型数据集 [英] sklearn and large datasets

查看:87
本文介绍了sklearn和大型数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个22 GB的数据集.我想在笔记本电脑上处理它.当然,我无法将其加载到内存中.

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.

我使用了很多sklearn,但是用于更小的数据集.

I use a lot sklearn but for much smaller datasets.

在这种情况下,经典方法应该类似于.

In this situations the classical approach should be something like.

仅读取部分数据->部分训练您的估算器->删除数据->读取数据的其他部分->继续训练您的估算器.

Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.

我已经看到某些sklearn算法具有部分拟合方法,该方法应允许我们使用数据的各种子样本来训练估计量.

I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.

现在我想知道为什么在sklearn中这样做很简单?我正在寻找

Now I am wondering is there an easy why to do that in sklearn? I am looking for something like

r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
     x = r.read_next_chunk(20 lines)
     m.partial_fit(x)

m.predict(new_x)

也许sklearn不是用于这类事情的正确工具?让我知道.

Maybe sklearn is not the right tool for these kind of things? Let me know.

推荐答案

我认为sklearn适用于较大的数据.如果您选择的算法支持partial_fit或在线学习方法,那么您将步入正轨.要注意的一件事是您的数据块大小可能会影响您的成功.

I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.

此链接可能有用...

This link may be useful... Working with big data in python and numpy, not enough ram, how to save partial results on disc?

我同意h5py很有用,但是您不妨使用箭袋中已经存在的工具.

I agree that h5py is useful but you may wish to use tools that are already in your quiver.

您可以做的另一件事是随机选择是否在csv文件中保留一行...并将结果保存到.npy文件中,以便更快地加载.这样一来,您将获得数据样本,这将使您可以使用所有算法开始使用它……并一路处理更大的数据问题(或根本不解决!有时候,一个方法不错的样本就足够了取决于您想要的.)

Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

这篇关于sklearn和大型数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆