多处理scikit学习 [英] Multiprocessing scikit-learn
问题描述
我使用load_file
方法使linearsvc与训练集和测试集相对应,我正在尝试使其在多处理器环境中起作用.
如何在LinearSVC().fit()
LinearSVC().predict()
上进行多处理工作?我还不太熟悉scikit-learn的数据类型.
我也正在考虑将样本拆分为多个数组,但是我对numpy数组和scikit-learn数据结构不熟悉.
这样做可以更容易地将其放入multiprocessing.pool(),从而将样本拆分为多个块,对其进行训练,然后将训练后的集合组合回去,这样行得通吗?
这是我的情况:
让我们说,我们的训练样本集中有1百万个文件,当我们想在多个处理器上分布Tfidfvectorizer的处理时,我们必须拆分这些样本(对于我来说,它只有两个类别,所以每个样本说500000个培训) .我的服务器具有24 GB的48 GB内核,因此我想将每个主题拆分为1000000/24的块数并在其上处理Tfidfvectorizer.这样,我将对Testing sample set以及SVC.fit()和Decision()进行处理.是否有意义?
谢谢.
PS:请不要关闭它.
我认为使用SGDClassifier而不是LinearSVC来处理此类数据将是一个好主意,因为它速度更快.对于矢量化,建议您查看哈希转换器PR . /p>
对于多处理:您可以跨核心分布数据集,执行partial_fit
,获取权重向量,对其求平均值,然后将其分布给估计量,然后再次进行局部拟合.
进行平行梯度下降是一个活跃的研究领域,因此那里没有现成的解决方案.
您的数据有多少个类别?对于每个课程,将单独(自动)进行培训.如果您拥有与内核几乎一样多的类,则通过在SGDClassifier中指定n_jobs
,每个内核仅执行一个类可能会更好,更容易.
I got linearsvc working against training set and test set using load_file
method i am trying to get It working on Multiprocessor enviorment.
How can i get multiprocessing work on LinearSVC().fit()
LinearSVC().predict()
? I am not really familiar with datatypes of scikit-learn yet.
I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.
Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?
EDIT: Here is my scenario:
lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
Thanks.
PS: Please do not close this .
I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.
For the multiprocessing: You can distribute the data sets across cores, do partial_fit
, get the weight vectors, average them, distribute them to the estimators, do partial fit again.
Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.
How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs
in SGDClassifier.
这篇关于多处理scikit学习的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!