R语言:计算“由...分组”的问题或用ff包分割 [英] R language: problems computing "group by" or split with ff package

查看:226
本文介绍了R语言:计算“由...分组”的问题或用ff包分割的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我几乎是R的新手,非常抱歉,如果我提出一些基本问题,但是我找不到解决这个简单问题的解决方案:
拥有一个数据库(大数据库,2,500万行,14列)的患者,每个id我有几行,例如这种结构:

pre $ idbirth_date treatmentdate_treatment
123 2002-01-01 2 2011-01-03
123 2002-01-01 3 2011-10-03
124 2002-01-01 6 2009 -11-07
124 2002-01-01不适用不适用
... ..... ...... ........
1022 2007-01 -01 4 2011-01-06

我必须使用ff包才能使用RAM的数量,所以所有的进程应该是ff函数。
我想知道,每个单一的ID,当他/她收到
a治疗= 2或4时,这是最小的年龄。所以,这将是,在每个单身,在通用代码中:



if(c(2,4)中的处理)然后min(date_treatment - birth_date)



我只想保留这些最小的年龄数据和ID。

一个解决方案是:

  age_c < - (数据$ date_treatment  - 数据$ birth_date)/365.25; 
data $ age_c < - age_c;
idx< - ffwhich(data,treatment%in%c(2,4));
结果< - data [idx,];

这会将所有进程保留为ff,并且不会出现内存问题,但...
我仍然需要找到一种方法来为每个ID使用这些最小年龄......
ffdfdply似乎可以做到这一点:

  age_fun<  -  function(x){
min_< - min.ff(x $ age_c);
data.frame(age = min_);


result2< - ffdfdply(x = data,
split = data $ id,
FUN = function(x)age_fun(x),
BATCHBYTES = 5000,
trace = TRUE
);

这需要花费很长时间,同时也给了我很多不同的错误....



任何解决方案?

这是一个普遍的问题,在SAS或SQL很容易做到,但我没有在R.中找到正确的组合b $ b所以一般问题是:

如何计算$ b中一个变量(行)的相同值(组)的行列函数$ b非常大的数据集???



谢谢!

解决方案

ffdfdply是解决您的问题所需的功能,但您错误地使用它并且效率低下。考虑到ffdfdply获取每个FUN时,最大数据量R允许你放入RAM,但仍然要确保你通过RAM中的每个ID获得所有数据(或者如果它适合RAM,可能是多个ID)。



因此,采用BATCHBYTES 5000是相当小的(你真的只有5千字节的RAM - 我猜不是 - 你是否将R安装在90年代的Commodore上?)接下来,你的FUN age_fun写错了。要查看您在FUN中获得的内容,可以打印出来。如FUN = function(x){print(head(x))); X}。
在FUN中,你得到的是RAM中的数据,所以你不需要使用min.ff,min也可以。



还要注意joran :如果您的RAM允许,您会在每个块中获得多个ID。确保你的FUN有一个分离应用组合策略或者在FUN中使用dply。
另一种说法是加快速度。你真的需要通过整个ffdf吗?您只需要在函数和分割中使用的列。因此,ffdfdply(x = data [c(id,age_c,treatment)],split = ...)将会在不需要的情况下获得RAM中的数据。



所以做个简单的例子就是这样:
$ b $ $ p $ require(doBy)
result2 < - ffdfdply(
x = data [c(id,age_c,treatment)],split = data $ id,
FUN = function(x)summaryBy(age_c〜 id,data = subset(x,%c(2,4)中的处理%),FUN = min))


$ b $如果你还想让你的人没有治疗2和4,那就这样做。

 需要(doBy)
result2< - ffdfdply(
x = data [c(id,age_c,treatment)],split = data $ id,
FUN = function (x){
persons< - unique(x [,id,drop = FALSE])
result< - merge(
persons,
summaryBy(age_c〜 id,data = subset(x,%c(2,4)中的治疗%),FUN = min),
by.x =id,by.y =id,all.x = TRUE ,all.y = FALSE

结果
})


I am nearly new to R, so sorry if I make some basic questions, but I can not find a solution to this "simple" problem: Having a database (big one, 25 million rows, 14 cols) of patients, I have several rows for each "id", with for example this structure:

"id" "birth_date"  "treatment"  "date_treatment"
123   2002-01-01    2            2011-01-03
123   2002-01-01    3            2011-10-03
124   2002-01-01    6            2009-11-07
124   2002-01-01    NA           NA
...   .....         ......       ........ 
1022  2007-01-01    4            2011-01-06

I have to use ff package to be able to work with little amount of RAM, so ALL the processes should be into ff functions. And I want to know, for each single "id", which is the minimum "age" when he/she received a treatment = 2 or 4. so, that would be, in each single id, in generic code :

if(treatment in c(2,4)) then min(date_treatment - birth_date)

I only want to keep those minimum "ages" data and perhaps the ids.

One solution is to do:

age_c <- (data$date_treatment - data$birth_date)/365.25;
data$age_c <- age_c;
idx <- ffwhich( data, treatment %in% c(2,4) );
result  <- data[idx,];

This keeps all the process into ff, and no memory problems, but... I still need to find a way to take those minimums ages for each id... ffdfdply seems to be able to do that:

age_fun <- function(x){ 
  min_ <- min.ff(x$age_c); 
  data.frame( age = min_);  
}

 result2 <- ffdfdply(x = data,
               split = data$id,
               FUN = function(x) age_fun(x),
               BATCHBYTES = 5000,
               trace=TRUE
 ); 

Which takes looooong time and also gives me a lot of different errors....

Any solution to that?
It is a general problem that in SAS or SQL are easy to do, but i do not find the right combination in R. So the general question would be:

how to compute row-column functions for identical values (groups) of a variable (row) in very big data sets ???

Thanks !!

解决方案

ffdfdply is the function you need to solve your question but you are using it wrong and inefficiently. Think about ffdfdply as getting in each FUN, the maximum number of data R allows you to put in RAM but still making sure you get all your data by each id in RAM (or possibly several id's if it fits into RAM).

So taking BATCHBYTES 5000 is rather small (do you really only have 5 kilobytes of RAM - I guess not - did you install R on a Commodore from the 90's?) Next, your FUN age_fun is written wrongly. To see what you get in the FUN you can print it out. as in FUN=function(x){ print(head(x))); x}. In FUN, you get data in RAM, so you don't need to use min.ff, min will do.

Also note the remark of joran: you get multiple id's in each chunk if your RAM allows to. Make sure your FUN does a split-apply-combine strategy or use dply in FUN. And another remark to speed things up. Do you really need to pass the whole ffdf. You only need the columns you use in the function and the split. So ffdfdply(x = data[c("id","age_c","treatment")], split = ...) will do otherwise you get data in RAM which is not needed.

So to be short, something like this will do the trick

require(doBy)
result2 <- ffdfdply(
  x = data[c("id","age_c","treatment")], split = data$id,
  FUN = function(x) summaryBy(age_c ~ id, data=subset(x, treatment %in% c(2,4)), FUN=min))

If you also want to have your persons who did not have treatment 2 and 4 whatsover, do like this.

require(doBy)
result2 <- ffdfdply(
  x = data[c("id","age_c","treatment")], split = data$id,
  FUN = function(x) {
   persons <- unique(x[, "id", drop=FALSE])
   result <- merge(
     persons,
     summaryBy(age_c ~ id, data=subset(x, treatment %in% c(2,4)), FUN=min),
     by.x="id", by.y="id", all.x=TRUE, all.y=FALSE
     )
   result
})

这篇关于R语言:计算“由...分组”的问题或用ff包分割的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆