scikit-learn是否适合大数据任务? [英] Is scikit-learn suitable for big data tasks?

查看:167
本文介绍了scikit-learn是否适合大数据任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一项涉及使用机器学习技术的TREC任务,其中数据集包含5 TB以上的Web文档,并计划从中提取词袋矢量. scikit-learn具有一组不错的功能,这些功能似乎可以满足我的需求,但是我不知道它是否可以很好地扩展以处理大数据.例如,HashingVectorizer是否能够处理5 TB的文档,并且对其进行并行化是否可行?此外,大型机器学习任务还有哪些替代方案?

I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?

推荐答案

HashingVectorizer将可以工作,如果您将数据迭代地分块成例如可容纳在内存中的10k或100k文档的批处理.

HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.

然后您可以将一批转换后的文档传递到支持partial_fit方法的线性分类器(例如SGDClassifierPassiveAggressiveClassifier),然后迭代新的批次.

You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.

您可以在不等待查看所有样本的情况下,在未验证的验证集(例如1万个文档)上开始对模型进行评分,以监控部分训练的模型的准确性.

You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.

您还可以在数据分区的多台计算机上并行执行此操作,然后对得到的coef_intercept_属性求平均值,以获得所有数据集的最终线性模型.

You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.

我在2013年3月在PyData进行的这次演讲中讨论了此问题: http://vimeo.com/63269736

I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736

查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆