在R中建模非常大的数据集(180万行x 270列) [英] Modeling a very big data set (1.8 Million rows x 270 Columns) in R

查看:116
本文介绍了在R中建模非常大的数据集(180万行x 270列)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 RAM为8 GB Windows 8 操作系统.我必须要执行 180万行x 270列的data.frame. (登录/其他分类)

I am working on a Windows 8 OS with a RAM of 8 GB . I have a data.frame of 1.8 million rows x 270 columns on which I have to perform a glm. (logit/any other classification)

我尝试使用ff和bigglm包来处理数据.

I've tried using ff and bigglm packages for handling the data.

但是我仍然遇到错误"Error: cannot allocate vector of size 81.5 Gb"的问题. 因此,我将行数减少到10,并尝试在ffdf类的对象上执行bigglm的步骤.但是错误仍然存​​在.

But I am still facing a problem with the error "Error: cannot allocate vector of size 81.5 Gb". So, I decreased the number of rows to 10 and tried the steps for bigglm on an object of class ffdf. However the error still is persisting.

有人能建议我解决这么多行和多列的分类模型的问题吗?

Can any one suggest me the solution of this problem of building a classification model with these many rows and columns?

**EDITS**:

运行代码时,我使用任何其他程序. 在我运行代码之前,系统上的RAM是60%可用的,这是因为R程序.当我终止R时,RAM释放了80%.

I am not using any other program when I am running the code. The RAM on the system is 60% free before I run the code and that is because of the R program. When I terminate R, the RAM 80% free.

我正在添加一些列,我正在根据评论者的建议进行复制. OPEN_FLG是DV ,其他是IDV

I am adding some of the columns which I am working with now as suggested by the commenters for reproduction. OPEN_FLG is the DV and others are IDVs

str(x[1:10,])
'data.frame':   10 obs. of  270 variables:
 $ OPEN_FLG                   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1    
 $ new_list_id                : Factor w/ 9 levels "0","3","5","6",..: 1 1 1 1 1 1 1 1 1 1    
 $ new_mailing_id             : Factor w/ 85 levels "1398","1407",..: 1 1 1 1 1 1 1 1 1 1    
 $ NUM_OF_ADULTS_IN_HHLD      : num  3 2 6 3 3 3 3 6 4 4    
 $ NUMBER_OF_CHLDRN_18_OR_LESS: Factor w/ 9 levels "","0","1","2",..: 2 2 4 7 3 5 3 4 2 5    
 $ OCCUP_DETAIL               : Factor w/ 49 levels "","00","01","02",..: 2 2 2 2 2 2 2 21 2 2    
 $ OCCUP_MIX_PCT              : num  0 0 0 0 0 0 0 0 0 0    
 $ PCT_CHLDRN                 : int  28 37 32 23 36 18 40 22 45 21   
 $ PCT_DEROG_TRADES           : num  41.9 38 62.8 2.9 16.9 ...    
 $ PCT_HOUSEHOLDS_BLACK       : int  6 71 2 1 0 4 3 61 0 13    
 $ PCT_OWNER_OCCUPIED         : int  91 66 63 38 86 16 79 19 93 22    
 $ PCT_RENTER_OCCUPIED        : int  8 34 36 61 14 83 20 80 7 77    
 $ PCT_TRADES_NOT_DEROG       : num  53.7 55 22.2 92.3 75.9 ...    
 $ PCT_WHITE                  : int  69 28 94 84 96 79 91 29 97 79    
 $ POSTAL_CD                  : Factor w/ 104568 levels "010011203","010011630",..: 23789 45173 32818 6260 88326 29954 28846 28998 52062 47577    
 $ PRES_OF_CHLDRN_0_3         : Factor w/ 4 levels "","N","U","Y": 2 2 3 4 2 4 2 4 2 4    
 $ PRES_OF_CHLDRN_10_12       : Factor w/ 4 levels "","N","U","Y": 2 2 4 3 3 2 3 2 2 3    
 [list output truncated]

这是我正在使用的代码的示例.

And this is the example of code which I am using.

require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)

require(ff)
x$id <- ffseq_len(nrow(x))
xex <- expand.ffgrid(x$id, ff(1:100))
colnames(xex) <- c("id","explosion.nr")
xex <- merge(xex, x, by.x="id", by.y="id", all.x=TRUE, all.y=FALSE)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = xex)

问题是两次都出现相同的错误"Error: cannot allocate vector of size 81.5 Gb".

The problem is both times I get the same error "Error: cannot allocate vector of size 81.5 Gb".

请告诉我这是否足够,或者我应该提供有关该问题的更多详细信息.

Please let me know if this is enough or should I include anymore details about the problem.

推荐答案

我给您的印象是您没有使用ffbase :: bigglm.ffdf,但您想使用ffbase :: bigglm.ffdf.也就是说,以下内容会将所有数据放入RAM中,并使用biglm :: bigglm.function,这不是您想要的.

I have the impression you are not using ffbase::bigglm.ffdf but you want to. Namely the following will put all your data in RAM and will use biglm::bigglm.function, which is not what you want.

require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)

您需要使用ffbase :: bigglm.ffdf,它可以在ffdf上逐块工作.因此,加载ffbase软件包,该软件包将导出bigglm.ffdf. 如果使用ffbase,则可以使用以下内容:

You need to use ffbase::bigglm.ffdf, which works chunkwise on an ffdf. So load package ffbase which exports bigglm.ffdf. If you use ffbase, you can use the following:

require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())

说明: 由于您不局限于模型中使用的列,因此您将不需要的xex ffdf的所有列存储在RAM中.您是在因子响应上使用高斯模型,很奇怪吗?我相信您正在尝试进行逻辑回归,因此使用适当的家庭论据吗?它将使用ffbase :: bigglm.ffdf而不是biglm :: bigglm.function.

Explanation: Because you don't limit yourself to the columns you use in the model, you will get all your columns of your xex ffdf in RAM which is not needed. You were using a gaussian model on a factor response, bizarre? I believe you were trying to do a logistic regression, so use the appropriate family argument? And it will use ffbase::bigglm.ffdf and not biglm::bigglm.function.

如果这不起作用-我怀疑,这是因为RAM中还有其他您不知道的东西.那样的话.

If that does not work - which I doubt, it is because you have other things in RAM which you are not aware of. In that case do.

require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
ffsave(mymodeldataset, file = "mymodeldataset")

## Open R again
require(ffbase)
require(biglm)
ffload("mymodeldataset")
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())

然后离开.

这篇关于在R中建模非常大的数据集(180万行x 270列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆