多处理 scikit-learn [英] Multiprocessing scikit-learn

查看:52
本文介绍了多处理 scikit-learn的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 load_file 方法让 linearsvc 针对训练集和测试集工作,我试图让它在多处理器环境中工作.

I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.

如何在 LinearSVC().fit() LinearSVC().predict() 上进行多处理工作?我还不太熟悉 scikit-learn 的数据类型.

How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.

我也在考虑将样本拆分为多个数组,但我不熟悉 numpy 数组和 scikit-learn 数据结构.

I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.

这样做会更容易放入 multiprocessing.pool() ,这样,将样本分成块,训练它们并稍后结合训练集,它会起作用吗?

Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?

这是我的场景:

假设,我们在训练样本集中有 100 万个文件,当我们想在多个处理器上分配 Tfidfvectorizer 的处理时,我们必须拆分这些样本(就我而言,它只有两个类别,所以假设每个样本有 500000 个训练) .我的服务器有 24 个内核和 48 GB,所以我想将每个主题分成 1000000/24 个块并在它们上处理 Tfidfvectorizer.就像我会做的测试样本集,以及 SVC.fit() 和决定().是否有意义?

lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?

谢谢.

PS:请不要关闭这个.

PS: Please do not close this .

推荐答案

我认为对此类数据使用 SGDClassifier 而不是 LinearSVC 是个好主意,因为它要快得多.对于矢量化,我建议您查看 哈希转换器 PR.

I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.

对于多处理:您可以跨内核分布数据集,执行partial_fit,获取权重向量,对它们求平均,将它们分配给估计器,再次进行部分拟合.

For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.

并行梯度下降是一个活跃的研究领域,因此没有现成的解决方案.

Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.

顺便说一下,您的数据有多少个类?对于每个班级,将(自动)训练一个单独的人.如果您的类几乎与内核一样多,那么通过在 SGDClassifier 中指定 n_jobs,每个内核只做一个类可能会更好也更容易.

How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.

这篇关于多处理 scikit-learn的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆