多处理scikit学习 [英] Multiprocessing scikit-learn

查看：80 发布时间：2020/5/4 9:00:07 python multithreading numpy machine-learning scikit-learn

本文介绍了多处理scikit学习的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用load_file方法使linearsvc与训练集和测试集相对应，我正在尝试使其在多处理器环境中起作用.

如何在LinearSVC().fit() LinearSVC().predict()上进行多处理工作?我还不太熟悉scikit-learn的数据类型.

我也正在考虑将样本拆分为多个数组，但是我对numpy数组和scikit-learn数据结构不熟悉.

这样做可以更容易地将其放入multiprocessing.pool()，从而将样本拆分为多个块，对其进行训练，然后将训练后的集合组合回去，这样行得通吗?

这是我的情况:

让我们说，我们的训练样本集中有1百万个文件，当我们想在多个处理器上分布Tfidfvectorizer的处理时，我们必须拆分这些样本(对于我来说，它只有两个类别，所以每个样本说500000个培训) .我的服务器具有24 GB的48 GB内核，因此我想将每个主题拆分为1000000/24的块数并在其上处理Tfidfvectorizer.这样，我将对Testing sample set以及SVC.fit()和Decision()进行处理.是否有意义?

谢谢.

PS:请不要关闭它.

解决方案

我认为使用SGDClassifier而不是LinearSVC来处理此类数据将是一个好主意，因为它速度更快.对于矢量化，建议您查看哈希转换器PR . /p>

对于多处理:您可以跨核心分布数据集，执行partial_fit，获取权重向量，对其求平均值，然后将其分布给估计量，然后再次进行局部拟合.

进行平行梯度下降是一个活跃的研究领域，因此那里没有现成的解决方案.

您的数据有多少个类别?对于每个课程，将单独(自动)进行培训.如果您拥有与内核几乎一样多的类，则通过在SGDClassifier中指定n_jobs，每个内核仅执行一个类可能会更好，更容易.

I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.

How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.

I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.

Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?

EDIT: Here is my scenario:

lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?

Thanks.

PS: Please do not close this .

解决方案

I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.

For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.

Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.

How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.

这篇关于多处理scikit学习的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

多处理scikit学习 [英] Multiprocessing scikit-learn

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

多处理scikit学习 [英] Multiprocessing scikit-learn

问题描述

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭