通过函数对每行data.table的值进行过滤 [英] filter rows by a function over values of each row, data.table
问题描述
从data.frame语法切换到data.table语法对我来说仍然不顺利。我认为以下事情应该是微不足道的,但不是。我在这里做错了什么:
Switch from data.frame syntax to data.table syntax is still not smooth for me. I thought the following thing should be trivial, but no. What I am doing wrong here:
> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
> DT
x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
我想要这样的东西:
cols = c("y", "v") # a vector of column names or indexes
DT[rowSums(cols) > 5] # Take only rows where
# values at colums y and v satisfy a condition. 'rowSums' here is just an
# example it can be any function that return TRUE or FALSE when applied
# to values of the row.
这项工作,但是如果我想提供动态列名怎么办?并且我的表有很多列?
This work, but what if I want to provide dynamic column names? and my tables have many columns?
>DT[eval( quote(y + v > 5))] #and the following command gives the same result
> DT[y + v > 5]
x y v
1: a 6 3
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9
> DT[lapply(.SD, sum) > 5, .SDcols = 2:3] # Want the same result as above
Empty data.table (0 rows) of 3 cols: x,y,v
> DT[lapply(.SD, sum) > 5, ,.SDcols = 2:3]
Empty data.table (0 rows) of 3 cols: x,y,v
> DT[lapply(.SD, sum) > 5, , .SDcols = c("y", "v")]
Empty data.table (0 rows) of 3 cols: x,y,v
答案后更新
事实证明,有很多方法可以做到这一点,所以我想看看旺旺是表现最好的人。以下是模拟的时序代码:
Update after the answers Since it turns out there are many ways of doing this thing, I want to see which want is the best performer. Following is the simulated timing code:
nr = 1e7
DT = data.table(x=sample(c("a","b","c"),nr, replace= T),
y=sample(2:5, nr, replace = T), v=sample(1:9, nr, T))
threshold = 5
cols = c("y", "v")
col.ids = 2:3
filter.methods = 'DT[DT[, rowSums(.SD[, cols, with = F]) > threshold]]
DT[DT[, rowSums(.SD[, col.ids, with = F]) > threshold]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = cols]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = c("y", "v")]]
DT[DT[, rowSums(.SD) > threshold, .SDcols = col.ids]]
DT[ ,.SD[rowSums(.SD[, col.ids, with = F]) > threshold]]
DT[ ,.SD[rowSums(.SD[, cols, with = F]) > threshold]]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = cols, by = x]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = col.ids, by = x]
DT[, .SD[rowSums(.SD) > threshold], .SDcols = c("y", "v"), by = x]
DT[Reduce(`+`,eval(cols))>threshold]
DT[Reduce(`+`, mget(cols)) > threshold]
'
fm <- strsplit(filter.methods, "\n")
fm <- unlist(fm)
timing = data.frame()
rn = NULL
for (e in sample(fm, length(fm))) {
# Seen some weird pattern with first item in 'fm', so scramble it
rn <- c(rn, e)
if (e == "DT[Reduce(`+`,eval(cols))>threshold]") {
cols = quote(list(y, v))
} else {
cols = c("y", "v")
}
tm <- system.time(eval(parse(text = e)))
timing <- rbind(timing,
data.frame(
as.list(tm[c("user.self", "sys.self", "elapsed")])
)
)
}
rownames(timing) <- rn
timing[order(timing$elapsed),]
### OUTPUT ####
# user.self sys.self elapsed
# DT[Reduce(`+`,eval(cols))>threshold] 0.416 0.168 0.581
# DT[Reduce(`+`, mget(cols)) > threshold] 0.412 0.172 0.582
# DT[DT[, rowSums(.SD) > threshold, .SDcols = cols]] 0.572 0.316 0.889
# DT[DT[, rowSums(.SD) > threshold, .SDcols = col.ids]] 0.568 0.320 0.889
# DT[DT[, rowSums(.SD) > threshold, .SDcols = c("y", "v")]] 0.576 0.316 0.890
# DT[ ,.SD[rowSums(.SD[, col.ids, with = F]) > threshold]] 0.648 0.404 1.052
# DT[DT[, rowSums(.SD[, cols, with = F]) > threshold]] 0.688 0.368 1.052
# DT[DT[, rowSums(.SD[, col.ids, with = F]) > threshold]] 0.612 0.440 1.053
# DT[ ,.SD[rowSums(.SD[, cols, with = F]) > threshold]] 0.692 0.368 1.058
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = c("y", "v"), by = x] 0.800 0.448 1.248
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = col.ids, by = x] 0.836 0.412 1.248
# DT[, .SD[rowSums(.SD) > threshold], .SDcols = cols, by = x] 0.836 0.416 1.249
速度方面:
DT[Reduce(`+`,eval(cols))>threshold]
DT[Reduce(`+`, mget(cols)) > threshold]
我更喜欢我的 mget
。而且我认为其他人之所以变慢是因为它们调用 rowSums
,而 Reduce
仅有助于形成表达式。衷心感谢所有给出的答案。我很难决定选择接受答案。基于 Reduce
的操作非常特定于此 sum
操作,而 rowSums $ c $基于c>的示例是使用任意函数的示例。
I prefer the one my mget
. And I think the reason that other are slower because they call rowSums
, whereas Reduce
only helps to form the expression. Sincere thanks to all that have given the answers. It's hard to decide for me to pick the 'accept' answer. Reduce
-based is very specific to this sum
operation, while rowSums
-based is an example of using arbitrary function.
推荐答案
cols = c("y", "v")
尝试
DT[DT[, rowSums(.SD[, ..cols]) > 5]]
或
DT[DT[, rowSums(.SD[, 2:3]) > 5]]
或
DT[DT[, rowSums(.SD) > 5, .SDcols = cols]]
Or
DT[DT[, rowSums(.SD) > 5, .SDcols = c("y", "v")]]
或
DT[DT[, rowSums(.SD) > 5, .SDcols = 2:3]]
Or
DT[ ,.SD[rowSums(.SD[, 2:3]) > 5]]
或
DT[ ,.SD[rowSums(.SD[, ..cols]) > 5]]
或
DT[, .SD[rowSums(.SD) > 5], .SDcols = cols, by = x]
Or
DT[, .SD[rowSums(.SD) > 5], .SDcols = 2:3, by = x]
Or
DT[, .SD[rowSums(.SD) > 5], .SDcols = c("y", "v"), by = x]
每个都会导致
# x y v
# 1: a 6 3
# 2: b 3 5
# 3: b 6 6
# 4: c 1 7
# 5: c 3 8
# 6: c 6 9
一些解释:
-
.SD
也是data.table
对象,可以在DT
内操作范围;.. cols
在[]
上方的调用帧中查找,用于cols
对象以选择列。因此,
这行DT [,rowSums(.SD [,..cols])> 5]
将返回一个逻辑矢量,指示在什么情况下DT
具有y + v> 5
。因此,我们将添加另一个DT
以便在DT
.SD
is also adata.table
object which can operate withinDT
scope;..cols
looks in the calling frames above the[]
where it's used for thecols
object to select columns. Thus, this lineDT[ ,rowSums(.SD[, ..cols]) > 5]
will return a logical vector indicating in which casesDT
hasy + v > 5
. So we will add anotherDT
in order to select this indices withinDT
当您使用 .SDcols
时,只会将 .SD
缩小到这些列。因此,如果您只做 DT [,.SD [rowSums(.SD)> 5] 、. SDcols = 2:3]
,您将丢失 x
列,因此 by = x
已添加。
When you use .SDcols
, it will narrow .SD
only to these columns. Thus if you only do something like DT[, .SD[rowSums(.SD) > 5], .SDcols = 2:3]
, you will lose the x
column, thus the by = x
was added.
使用 .SDcols
时的另一种选择是返回逻辑矢量,然后将其嵌入到另一个 DT
Another option when using .SDcols
is to return a logical vector and then imbed it in another DT
这篇关于通过函数对每行data.table的值进行过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!