如何优化大型 R 数据框中每一行的过滤和计数 [英] How to optimise filtering and counting for every row in a large R data frame

查看:23
本文介绍了如何优化大型 R 数据框中每一行的过滤和计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,如下所示:

I have a data frame, such as the following:

  name day wages
1  Ann   1   100
2  Ann   1   150
3  Ann   2   200
4  Ann   3   150
5  Bob   1   100
6  Bob   1   200
7  Bob   1   150
8  Bob   2   100

对于每个唯一的姓名/日期对,我想计算一个总数范围,例如此人当天或第二天的工资超过 175 的次数".列比工资多得多,并且有四个时间片适用于每一行的每个总计.

For every unique name/day pair, I would like to calculate a range of totals, such as 'number of times wages was greater than 175 on current or next day for this person'. There are many more columns than wages and there are four time-slices to be applied to each total for each row.

我目前可以通过对我的数据框进行唯一化来完成:

I can currently accomplish by unique'ing my data frame:

df.unique <- df[!duplicated(df[,c('name','day')]),]

然后对于 df.unique 中的每一行,将以下函数(为清楚起见而手写)应用到 df:

And then for every row in df.unique, applying the following function (written longhand for clarity) to df:

for(i in 1:nrow(df.unique)) {
    df.unique[i,"wages_gt_175_day_and_next"] <- wages_gt_for_person_today_or_next(df,175,df.unique[i,"day"],df.unique[i,"name"])
}

wages_gt_for_person_today_or_next <- function(df,amount,day,person) {
  temp <- df[df$name==person,]
  temp <- temp[temp$day==day|temp$day==day+1,]
  temp <- temp[temp$wages > amount,]
  return(nrow(temp))
}

给我,在这个简单的例子中:

Giving me, in this trivial example:

name day wages_gt_175_day_and_next
Ann   1   1
Ann   2   1
Ann   3   0
Bob   1   1
Bob   2   0

然而,鉴于我有数十万行,这似乎是一种非常缓慢的方法.有没有更聪明的方法来做到这一点?矩阵运算、apply、sqldf 之类的东西?

However, this seems an extremely slow approach, given that I have hundreds of thousands of rows. Is there a cleverer way of doing this? Something with matrix operations, apply, sqldf, anything like that?

重新创建示例 df 的代码:

Code to recreate example df:

structure(list(name = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L), .Label = c("Ann", "Bob"), class = "factor"), day = c(1, 
1, 2, 3, 1, 1, 1, 2), wages = c(100, 150, 200, 150, 100, 200, 
150, 100)), .Names = c("name", "day", "wages"), row.names = c(NA, 
-8L), class = "data.frame")

推荐答案

简单地从你的示例输出来看,这里有一些使用 data.table 的东西:

Going simply from your example output, here's something a bit fancier using data.table:

require(data.table)
DT <- data.table(df)
setkey(DT,name,day)

DT[,list(gt175 = sum(wages >= 175)),list(name,day)][,list(day = day,gt175 = as.integer(gt175 + c(tail(gt175,-1),0) > 0)),list(name)]

这有点复杂,但应该很快.

This is a little convoluted, but should be fast.

这篇关于如何优化大型 R 数据框中每一行的过滤和计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆