scikit-learn是否适合大数据任务? [英] Is scikit-learn suitable for big data tasks?
问题描述
我正在从事一项涉及使用机器学习技术的TREC任务,其中数据集包含5 TB以上的Web文档,并计划从中提取词袋矢量. scikit-learn
具有一组不错的功能,这些功能似乎可以满足我的需求,但是我不知道它是否可以很好地扩展以处理大数据.例如,HashingVectorizer
是否能够处理5 TB的文档,并且对其进行并行化是否可行?此外,大型机器学习任务还有哪些替代方案?
I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn
has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer
able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?
推荐答案
HashingVectorizer
将可以工作,如果您将数据迭代地分块成例如可容纳在内存中的10k或100k文档的批处理.
HashingVectorizer
will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
然后您可以将一批转换后的文档传递到支持partial_fit
方法的线性分类器(例如SGDClassifier
或PassiveAggressiveClassifier
),然后迭代新的批次.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit
method (e.g. SGDClassifier
or PassiveAggressiveClassifier
) and then iterate on new batches.
您可以在不等待查看所有样本的情况下,在未验证的验证集(例如1万个文档)上开始对模型进行评分,以监控部分训练的模型的准确性.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
您还可以在数据分区的多台计算机上并行执行此操作,然后对得到的coef_
和intercept_
属性求平均值,以获得所有数据集的最终线性模型.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_
and intercept_
attribute to get a final linear model for the all dataset.
我在2013年3月在PyData进行的这次演讲中讨论了此问题: http://vimeo.com/63269736
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
此 https://github.com/ogrisel/parallel_ml_tutorial
这篇关于scikit-learn是否适合大数据任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!