如何为大型R数据帧中的每一行优化过滤和计数 [英] How to optimise filtering and counting for every row in a large R data frame

查看:158
本文介绍了如何为大型R数据帧中的每一行优化过滤和计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,例如:

 姓名日工资
1 Ann 1 100
2 Ann 1 150
3 Ann 2 200
4 Ann 3 150
5 Bob 1 100
6 Bob 1 200
7 Bob 1 150
8 Bob 2 100

对于每个唯一的名称/日对,我想计算一个范围总数,例如此人的当前或次日工资大于175的次数。

我现在可以通过唯一地设置我的数据框来完成:

  df.unique<  -  df [!duplicate(df [,c('name','day')] ),] 

然后对 df.unique :将 df 应用以下函数(为了清楚起见,以书写):

 code> for(i in 1:nrow(df.unique)){
df.unique [i,wages_gt_175_day_and_next]< - wages_gt_for_person_today_or_next(df,175,df.unique [i, ],df.unique [i,name])
}

wages_gt_for_person_today_or_next< - function(df,amount,day,person){
temp& df [df $ name == person,]
temp< - temp [temp $ day == day | temp $ day == day + 1,]
temp< - temp [temp $ wages > amount,]
return(nrow(temp))
}

,在这个简单的例子中:

 名称日wages_gt_175_day_and_next 
Ann 1 1
Ann 2 1
Ann 3 0
Bob 1 1
Bob 2 0

似乎是一个非常缓慢的方法,因为我有成千上万的行。有一个聪明的方式这样做吗?



用于重新创建示例df的代码:

 结构(list(name = structure(c(1L,1L,1L,1L,2L,2L,2L,
2L),.Label = c ),class =factor),day = c(1,
1,2,3,1,1,1,2),工资= c(100,150,200,150,100,200 ,
150,100)),.Names = c(name,day,wages),row.names = c(NA,
-8L),class =data。框架)


解决方案

有些鸽友使用 data.table

  require表格)
DT< - data.table(df)
setkey(DT,name,day)

DT [,list(gt175 = sum(wages> )),list(name,day)] [,list(day = day,gt175 = as.integer(gt175 + c(tail(gt175,-1),0)> 0) b $ b

这是一个有点复杂,但应该快。


I have a data frame, such as the following:

  name day wages
1  Ann   1   100
2  Ann   1   150
3  Ann   2   200
4  Ann   3   150
5  Bob   1   100
6  Bob   1   200
7  Bob   1   150
8  Bob   2   100

For every unique name/day pair, I would like to calculate a range of totals, such as 'number of times wages was greater than 175 on current or next day for this person'. There are many more columns than wages and there are four time-slices to be applied to each total for each row.

I can currently accomplish by unique'ing my data frame:

df.unique <- df[!duplicated(df[,c('name','day')]),]

And then for every row in df.unique, applying the following function (written longhand for clarity) to df:

for(i in 1:nrow(df.unique)) {
    df.unique[i,"wages_gt_175_day_and_next"] <- wages_gt_for_person_today_or_next(df,175,df.unique[i,"day"],df.unique[i,"name"])
}

wages_gt_for_person_today_or_next <- function(df,amount,day,person) {
  temp <- df[df$name==person,]
  temp <- temp[temp$day==day|temp$day==day+1,]
  temp <- temp[temp$wages > amount,]
  return(nrow(temp))
}

Giving me, in this trivial example:

name day wages_gt_175_day_and_next
Ann   1   1
Ann   2   1
Ann   3   0
Bob   1   1
Bob   2   0

However, this seems an extremely slow approach, given that I have hundreds of thousands of rows. Is there a cleverer way of doing this? Something with matrix operations, apply, sqldf, anything like that?

Code to recreate example df:

structure(list(name = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L), .Label = c("Ann", "Bob"), class = "factor"), day = c(1, 
1, 2, 3, 1, 1, 1, 2), wages = c(100, 150, 200, 150, 100, 200, 
150, 100)), .Names = c("name", "day", "wages"), row.names = c(NA, 
-8L), class = "data.frame")

解决方案

Going simply from your example output, here's something a bit fancier using data.table:

require(data.table)
DT <- data.table(df)
setkey(DT,name,day)

DT[,list(gt175 = sum(wages >= 175)),list(name,day)][,list(day = day,gt175 = as.integer(gt175 + c(tail(gt175,-1),0) > 0)),list(name)]

This is a little convoluted, but should be fast.

这篇关于如何为大型R数据帧中的每一行优化过滤和计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆