在R中并行执行glmnet [英] executing glmnet in parallel in R

查看:352
本文介绍了在R中并行执行glmnet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的训练数据集有大约20万条记录,我有500个特征。 (这些是来自零售组织的销售数据)。大多数特征是0/1,并被存储为稀疏矩阵。

目标是预测约200种产品的购买概率。所以,我需要使用相同的500个功能来预测200个产品的购买概率。由于glmnet是模型创建的自然选择,所以我想为200个产品并行实现glmnet。 (因为所有的200个模型是独立的)但是我坚持使用foreach。我执行的代码是:

  foreach(i = 1:ncol(target))%dopar%
{
assign(model [i],cv.glmnet(x,target [,i],family =binomial,alpha = 0,type.measure =auc,grouped = FALSE,standardize = FALSE,parallel = TRUE))
}

模型是一个列表 - 列出200个模型名称我想存储各自的模型。

以下代码有效。但是它并没有利用并行结构并花了大概一天时间完成!

  for(i in 1:ncol(target ))
{assign(model [i],cv.glmnet(x,target [,i],family =binomial,alpha = 0,type.measure =auc,grouped = FALSE,standardize = FALSE,parallel = TRUE))
}

有人可以指点我如何利用在这种情况下的并行结构?为了同时执行cv.glmnet,你必须指定 parallel = 。 TRUE 选项,注册一个foreach并行后端。这允许您选择最适合您的计算环境的并行后端。



以下是来自cv.glmnet手册页的parallel参数的文档: parallel:如果'TRUE',使用parallel'foreach'来适应每个fold。必须先手动注册,如doMC等。请参阅下面的示例。

下面是使用可在Windows,Mac OS X和Linux上运行的doParallel包的示例:

  library(doParallel)
registerDoParallel(4)
m < - cv.glmnet(x,target [,1 ],family =binomial,alpha = 0,type.measure =auc,
grouped = FALSE,standardize = FALSE,parallel = TRUE)

这个对cv.glmnet的调用将使用四个工作者并行执行。在Linux和Mac OS X上,它将使用mclapply来执行任务,而在Windows上它将使用clusterApplyLB。

嵌套并行性变得棘手,只有4名工人帮助很多。我会尝试使用一个正常的循环cv.glmnet(如你的第二个例子),并行后端注册,看看是什么性能是在添加另一个级别的并行。



<请注意,在注册并行后端时,第一个示例中的model分配不起作用。当并行运行时,与大多数并行编程软件包一样,副作用通常会被抛弃。

My training dataset has about 200,000 records and I have 500 features. (These are sales data from a retail org). Most of the features are 0/1 and is stored as a sparse matrix.

The goal is to predict the probability to buy for about 200 products. So, I would need to use the same 500 features to predict the probability of purchase for 200 products. Since glmnet is a natural choice for model creation, I thought about implementing glmnet in parallel for the 200 products. (Since all the 200 models are independent) But I am stuck using foreach. The code I executed was:

foreach(i = 1:ncol(target)) %dopar%
{
assign(model[i],cv.glmnet(x,target[,i],family="binomial",alpha=0,type.measure="auc",grouped=FALSE,standardize=FALSE,parallel=TRUE))
}

model is a list - having the list of 200 model names where I want to store the respective models.

The following code works. But it doesn't exploit the parallel structure and takes about a day to finish !

for(i in 1:ncol(target))
{ assign(model[i],cv.glmnet(x,target[,i],family="binomial",alpha=0,type.measure="auc",grouped=FALSE,standardize=FALSE,parallel=TRUE))
}

Can someone point to me on how to exploit the parallel structure in this case?

解决方案

In order to execute "cv.glmnet" in parallel, you have to specify the parallel=TRUE option, and register a foreach parallel backend. This allows you to choose the parallel backend that works best for your computing environment.

Here's the documentation for the "parallel" argument from the cv.glmnet man page:

parallel: If 'TRUE', use parallel 'foreach' to fit each fold. Must register parallel before hand, such as 'doMC' or others. See the example below.

Here's an example using the doParallel package which works on Windows, Mac OS X, and Linux:

library(doParallel)
registerDoParallel(4)
m <- cv.glmnet(x, target[,1], family="binomial", alpha=0, type.measure="auc",
               grouped=FALSE, standardize=FALSE, parallel=TRUE)

This call to cv.glmnet will execute in parallel using four workers. On Linux and Mac OS X, it will execute the tasks using "mclapply", while on Windows it will use "clusterApplyLB".

Nested parallelism gets tricky, and may not help a lot with only 4 workers. I would try using a normal for loop around cv.glmnet (as in your second example) with a parallel backend registered and see what the performance is before adding another level of parallelism.

Also note that the assignment to "model" in your first example isn't going to work when you register a parallel backend. When running in parallel, side-effects generally get thrown away, as with most parallel programming packages.

这篇关于在R中并行执行glmnet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆