多处理scikit学习 [英] Multiprocessing scikit-learn

查看:80
本文介绍了多处理scikit学习的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用load_file方法使linearsvc与训练集和测试集相对应,我正在尝试使其在多处理器环境中起作用.

如何在LinearSVC().fit() LinearSVC().predict()上进行多处理工作?我还不太熟悉scikit-learn的数据类型.

我也正在考虑将样本拆分为多个数组,但是我对numpy数组和scikit-learn数据结构不熟悉.

这样做可以更容易地将其放入multiprocessing.pool(),从而将样本拆分为多个块,对其进行训练,然后将训练后的集合组合回去,这样行得通吗?

这是我的情况:

让我们说,我们的训练样本集中有1百万个文件,当我们想在多个处理器上分布Tfidfvectorizer的处理时,我们必须拆分这些样本(对于我来说,它只有两个类别,所以每个样本说500000个培训) .我的服务器具有24 GB的48 GB内核,因此我想将每个主题拆分为1000000/24的块数并在其上处理Tfidfvectorizer.这样,我将对Testing sample set以及SVC.fit()和Decision()进行处理.是否有意义?

谢谢.

PS:请不要关闭它.

解决方案

我认为使用SGDClassifier而不是LinearSVC来处理此类数据将是一个好主意,因为它速度更快.对于矢量化,建议您查看哈希转换器PR . /p>

对于多处理:您可以跨核心分布数据集,执行partial_fit,获取权重向量,对其求平均值,然后将其分布给估计量,然后再次进行局部拟合.

进行平行梯度下降是一个活跃的研究领域,因此那里没有现成的解决方案.

您的数据有多少个类别?对于每个课程,将单独(自动)进行培训.如果您拥有与内核几乎一样多的类,则通过在SGDClassifier中指定n_jobs,每个内核仅执行一个类可能会更好,更容易.

I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.

How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.

I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.

Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?

EDIT: Here is my scenario:

lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?

Thanks.

PS: Please do not close this .

解决方案

I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.

For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.

Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.

How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.

这篇关于多处理scikit学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆