加快sklearn逻辑回归 [英] Speeding up sklearn logistic regression

查看:69
本文介绍了加快sklearn逻辑回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个模型正在尝试使用 sklearn 中的 LogisticRegression 构建,该模型具有数千种功能和大约60,000个样本.我正在尝试拟合模型,并且它已经运行了大约10分钟.我正在运行它的机器有 GB 的 RAM 和几个可供使用的内核,我想知道是否有任何方法可以加快进程

编辑该机器具有24个核心,这是top的输出,用于给出内存的概念

 进程:总共94个,运行8个,阻塞3个,睡眠83个,583个线程20:10:19平均负载:1.49,1.25,1.19 CPU使用率:4.34%用户,0.68%sys,94.96%空闲共享库:1552K常驻,0B数据,0B链接.记忆区:总共51959个,居民53G,私有4600万,共享6.76亿.PhysMem:有线3804M,活动57G,未活动1042M,已使用62G,免费34G.VM:350G vsize,1092M框架vsize,52556024(0)换页,85585722(0)换页网络:数据包:172806918/25G输入,27748484/7668M输出.磁盘:读取14763149/306G,写入26390627/1017G. 

我正在尝试使用以下方法训练模型

 分类器= LogisticRegression(C = 1.0,class_weight ='auto')classifier.fit(训练,响应) 

train 的行长约为3000(都是浮点数),并且 response 中的每一行都是 0 1.我大约有50,000次观察

解决方案

更新-2017:

在scikit-learn的当前版本中, LogisticRegression() 现在具有 n_jobs 参数来利用多个内核.

但是,用户指南的实际文本表明,在计算的后半段仍仅使用多个内核.在此更新中,修订后的 LogisticRegression 用户指南现在说, njobs 选择交叉验证循环中使用的CPU内核数",而其他两项引用在原始响应中, RandomForestClassifier() RandomForestRegressor()都指出 njobs 指定适合并预测".换句话说,此处措辞的故意对比似乎表明,虽然现在实现了 LogisticRegression()中的 njobs 参数,但实际上并没有完全实现,或者与其他两种情况一样.

因此,尽管现在可以通过使用多个内核来加快 LogisticRegression()的速度,但我的猜测是,与所使用的内核数量成比例的线性关系可能并不十分线性,这听起来像是最初的拟合"步骤(算法的前半部分)可能无法很好地进行并行化.


原始答案:

在我看来,这里的主要问题似乎不是内存,而是您只使用了一个内核.根据top,您正在以4.34%的速度加载系统.如果您的逻辑回归过程垄断了24个核心中的1个,那么得出的结果就是100/24 = 4.167%.大概剩余的0.17%会占用您也在计算机上运行的所有其他进程,并且由于系统已将它们安排为在第二个不同的内核上并行运行,因此它们可以额外占用0.17%.

如果您点击下面的链接并查看scikit-learn API,您将看到一些整体方法,例如 <代码> LogisticRegression() 没有定义此输入.scikit-learn的设计者似乎已经创建了一个接口,该接口通常在类之间是相当一致的,因此,如果未为给定类定义特定的输入参数,则可能意味着开发人员根本无法找出实现该方法的方法.该类的有意义的选择.逻辑回归算法可能无法很好地适应并行化.即,可能已经实现的潜在加速尚不足以证明实现它的合理性具有并行架构.

假设是这种情况,那么没有,您可以做很多事情来使代码运行得更快.如果底层库函数的设计根本无法利用它们,那么24个内核将无济于事.

I have a model I'm trying to build using LogisticRegression in sklearn that has a couple thousand features and approximately 60,000 samples. I'm trying to fit the model and it's been running for about 10 mins now. The machine I'm running it on has gigabytes of RAM and several cores at its disposal and I was wondering if there is any way to speed the process up

EDIT The machine has 24 cores and here is the output of top to give an idea of memory

Processes: 94 total, 8 running, 3 stuck, 83 sleeping, 583 threads      20:10:19
Load Avg: 1.49, 1.25, 1.19  CPU usage: 4.34% user, 0.68% sys, 94.96% idle
SharedLibs: 1552K resident, 0B data, 0B linkedit.
MemRegions: 51959 total, 53G resident, 46M private, 676M shared.
PhysMem: 3804M wired, 57G active, 1042M inactive, 62G used, 34G free.
VM: 350G vsize, 1092M framework vsize, 52556024(0) pageins, 85585722(0) pageouts
Networks: packets: 172806918/25G in, 27748484/7668M out.
Disks: 14763149/306G read, 26390627/1017G written.

I'm trying to train the model with the following

classifier = LogisticRegression(C=1.0, class_weight = 'auto')
classifier.fit(train, response)

train has rows that are approximately 3000 long (all floating point) and each row in response is either 0 or 1. I have approximately 50,000 observations

解决方案

UPDATE - 2017:

In current version of scikit-learn, LogisticRegression() now has n_jobs parameter to utilize multiple cores.

However, the actual text of the user guide suggests that multiple cores are still only being utilized during the second half of the computation. As of this update, the revised user guide for LogisticRegression now says that njobs chooses the "Number of CPU cores used during the cross-validation loop" whereas the other two items cited in the original response, RandomForestClassifier() and RandomForestRegressor(), both state that njobs specifies "The number of jobs to run in parallel for both fit and predict". In other words, the deliberate contrast in phrasing here seems to be pointing out that the njobs parameter in LogisticRegression(), while now implemented, is not really implemented as completely, or in the same way, as in the other two cases.

Thus, while it may now be possible to speed up LogisticRegression() somewhat by using multiple cores, my guess is that it probably won't be very linear in proportion to the number of cores used, as it sounds like the initial "fit" step (the first half of the algorithm) may not lend itself well to parallelization.


Original Answer:

To my eye, it looks like the major issue here isn't memory, it's that you are only using one core. According to top, you are loading the system at 4.34%. If your logistic regression process is monopolizing 1 core out of 24, then that comes out to 100/24 = 4.167%. Presumably the remaining 0.17% accounts for whatever other processes you are also running on the machine, and they are allowed to take up an extra 0.17% because they are being scheduled by the system to run in parallel on a 2nd, different core.

If you follow the links below and look at the scikit-learn API, you'll see that some of the ensemble methods such as RandomForestClassifier() or RandomForestRegressor() have an input parameter called n_jobs which directly controls the number of cores on which the package will attempt to run in parallel. The class that you are using, LogisticRegression() doesn't define this input. The designers of scikit-learn seem to have created an interface which is generally pretty consistent between classes, so if a particular input parameter is not defined for a given class, it probably means that the developers simply could not figure out a way to implement the option in a meaningful way for that class. It may be the case that the logistic regression algorithm simply doesn't lend itself well to parallelization; i.e., the potential speedup that could have been achieved just wasn't good enough to have justified implementing it with a parallel architecture.

Assuming that this is the case, then no, there's not much you can do to make your code go faster. 24 cores doesn't help you if the underlying library functions simply weren't designed to be able to take advantage of them.

这篇关于加快sklearn逻辑回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆