通过函数对每行data.table的值进行过滤 [英] filter rows by a function over values of each row, data.table

查看:63
本文介绍了通过函数对每行data.table的值进行过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从data.frame语法切换到data.table语法对我来说仍然不顺利。我认为以下事情应该是微不足道的,但不是。我在这里做错了什么:

Switch from data.frame syntax to data.table syntax is still not smooth for me. I thought the following thing should be trivial, but no. What I am doing wrong here:

> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
> DT
   x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9

我想要这样的东西:

cols = c("y", "v") # a vector of column names or indexes
DT[rowSums(cols) > 5] # Take only rows where
# values at colums y and v satisfy a condition. 'rowSums' here is just an
# example it can be any function that return TRUE or FALSE when applied 
# to values of the row. 

这项工作,但是如果我想提供动态列名怎么办?并且我的表有很多列?

This work, but what if I want to provide dynamic column names? and my tables have many columns?

>DT[eval( quote(y + v > 5))] #and the following command gives the same result
> DT[y + v > 5]
   x y v
1: a 6 3
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9
> DT[lapply(.SD, sum) > 5, .SDcols = 2:3] # Want the same result as above
Empty data.table (0 rows) of 3 cols: x,y,v
> DT[lapply(.SD, sum) > 5, ,.SDcols = 2:3]
Empty data.table (0 rows) of 3 cols: x,y,v
> DT[lapply(.SD, sum) > 5, , .SDcols = c("y", "v")]
Empty data.table (0 rows) of 3 cols: x,y,v

答案后更新
事实证明,有很多方法可以做到这一点,所以我想看看旺旺是表现最好的人。以下是模拟的时序代码:

Update after the answers Since it turns out there are many ways of doing this thing, I want to see which want is the best performer. Following is the simulated timing code:

nr = 1e7
DT = data.table(x=sample(c("a","b","c"),nr, replace= T),
                y=sample(2:5, nr, replace = T), v=sample(1:9, nr, T))
threshold = 5
cols = c("y", "v")
col.ids = 2:3
filter.methods = 'DT[DT[, rowSums(.SD[, cols, with = F]) > threshold]]
DT[DT[, rowSums(.SD[, col.ids, with = F]) > threshold]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = cols]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = c("y", "v")]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = col.ids]]
DT[ ,.SD[rowSums(.SD[, col.ids, with = F]) > threshold]]
DT[ ,.SD[rowSums(.SD[, cols, with = F]) > threshold]]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = cols, by = x]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = col.ids, by = x]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = c("y", "v"), by = x]
DT[Reduce(`+`,eval(cols))>threshold]
DT[Reduce(`+`, mget(cols)) > threshold]
'
fm <- strsplit(filter.methods, "\n")
fm <- unlist(fm)
timing = data.frame()
rn = NULL
for (e in sample(fm, length(fm))) { 
  # Seen some weird pattern with first item in 'fm', so scramble it
  rn <- c(rn, e)
  if (e == "DT[Reduce(`+`,eval(cols))>threshold]") {
    cols = quote(list(y, v))
  } else {
    cols = c("y", "v")
  }
  tm <- system.time(eval(parse(text = e)))
  timing <- rbind(timing, 
                  data.frame(
                    as.list(tm[c("user.self", "sys.self", "elapsed")])
                    )
                  )
}
rownames(timing) <- rn
timing[order(timing$elapsed),]

### OUTPUT ####
#                                                                     user.self sys.self elapsed
# DT[Reduce(`+`,eval(cols))>threshold]                                   0.416    0.168   0.581
# DT[Reduce(`+`, mget(cols)) > threshold]                                0.412    0.172   0.582
# DT[DT[, rowSums(.SD) > threshold, .SDcols = cols]]                     0.572    0.316   0.889
# DT[DT[, rowSums(.SD) > threshold, .SDcols = col.ids]]                  0.568    0.320   0.889
# DT[DT[, rowSums(.SD) > threshold, .SDcols = c("y", "v")]]              0.576    0.316   0.890
# DT[ ,.SD[rowSums(.SD[, col.ids, with = F]) > threshold]]               0.648    0.404   1.052
# DT[DT[, rowSums(.SD[, cols, with = F]) > threshold]]                   0.688    0.368   1.052
# DT[DT[, rowSums(.SD[, col.ids, with = F]) > threshold]]                0.612    0.440   1.053
# DT[ ,.SD[rowSums(.SD[, cols, with = F]) > threshold]]                  0.692    0.368   1.058
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = c("y", "v"), by = x]     0.800    0.448   1.248
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = col.ids, by = x]         0.836    0.412   1.248
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = cols, by = x]            0.836    0.416   1.249

速度方面:

DT[Reduce(`+`,eval(cols))>threshold]
DT[Reduce(`+`, mget(cols)) > threshold]

我更喜欢我的 mget 。而且我认为其他人之所以变慢是因为它们调用 rowSums ,而 Reduce 仅有助于形成表达式。衷心感谢所有给出的答案。我很难决定选择接受答案。基于 Reduce 的操作非常特定于此 sum 操作,而 rowSums 的示例是使用任意函数的示例。

I prefer the one my mget. And I think the reason that other are slower because they call rowSums, whereas Reduce only helps to form the expression. Sincere thanks to all that have given the answers. It's hard to decide for me to pick the 'accept' answer. Reduce-based is very specific to this sum operation, while rowSums-based is an example of using arbitrary function.

推荐答案

cols = c("y", "v")

尝试

DT[DT[, rowSums(.SD[, ..cols]) > 5]]

DT[DT[, rowSums(.SD[, 2:3]) > 5]]

DT[DT[, rowSums(.SD) > 5, .SDcols = cols]]

Or

DT[DT[, rowSums(.SD) > 5, .SDcols = c("y", "v")]]

DT[DT[, rowSums(.SD) > 5, .SDcols = 2:3]]

Or

DT[ ,.SD[rowSums(.SD[, 2:3]) > 5]]

DT[ ,.SD[rowSums(.SD[, ..cols]) > 5]]

DT[, .SD[rowSums(.SD) > 5], .SDcols = cols, by = x]

Or

DT[, .SD[rowSums(.SD) > 5], .SDcols = 2:3, by = x]

Or

DT[, .SD[rowSums(.SD) > 5], .SDcols = c("y", "v"), by = x]

每个都会导致

#    x y v
# 1: a 6 3
# 2: b 3 5
# 3: b 6 6
# 4: c 1 7
# 5: c 3 8
# 6: c 6 9

一些解释:


  1. .SD 也是 data.table 对象,可以在 DT 内操作范围; .. cols [] 上方的调用帧中查找,用于 cols 对象以选择列。因此,
    这行 DT [,rowSums(.SD [,..cols])> 5] 将返回一个逻辑矢量,指示在什么情况下 DT 具有 y + v> 5 。因此,我们将添加另一个 DT 以便在 DT

  1. .SD is also a data.table object which can operate within DT scope; ..cols looks in the calling frames above the [] where it's used for the cols object to select columns. Thus, this line DT[ ,rowSums(.SD[, ..cols]) > 5] will return a logical vector indicating in which cases DT has y + v > 5. So we will add another DT in order to select this indices within DT

当您使用 .SDcols 时,只会将 .SD 缩小到这些列。因此,如果您只做 DT [,.SD [rowSums(.SD)> 5] 、. SDcols = 2:3] ,您将丢失 x 列,因此 by = x 已添加。

When you use .SDcols, it will narrow .SD only to these columns. Thus if you only do something like DT[, .SD[rowSums(.SD) > 5], .SDcols = 2:3], you will lose the x column, thus the by = x was added.

使用 .SDcols 时的另一种选择是返回逻辑矢量,然后将其嵌入到另一个 DT

Another option when using .SDcols is to return a logical vector and then imbed it in another DT

这篇关于通过函数对每行data.table的值进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆