根据特定值过滤 data.frame 的每一列 [英] Filter each column of a data.frame based on a specific value

查看:23
本文介绍了根据特定值过滤 data.frame 的每一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下数据框:

df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE)))

#   X1 X2 X3 X4 X5
#1   7  9  8  4 10
#2   2  4  9  4  9
#3   2  7  8  8  6
#4   8  9  6  6  4
#5   5  2  1  4  6
#6   8  2  2  1  7
#7   3  8  6  1  6
#8   3  8  5  9  8
#9   6  2  3 10  7
#10  2  7  4  2  9

使用 dplyr,我如何在每一列(不隐式命名)上过滤所有大于 2 的值.

Using dplyr, how can I filter, on each column (without implicitly naming them), for all values greater than 2.

可以模仿假设的 filter_each(funs(. >= 2))

现在我正在做:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2)

相当于:

df %>% filter(!rowSums(. < 2))

注意:假设我只想过滤前 4 列,我会这样做:

Note: Let's say I wanted to filter only on the first 4 columns, I would do:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2) 

df %>% filter(!rowSums(.[-5] < 2))

是否有更有效的替代方案?

Would there be a more efficient alternative ?

子问题

如何指定列名并模拟假设的 filter_each(funs(. >= 2), -X5) ?

How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

基准子问题

由于我必须在大型数据集上运行它,因此我对建议进行了基准测试.

Since I have to run this on a large dataset, I benchmarked the suggestions.

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50
)

结果如下:

#Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval
#   Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458    50
# Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669    50
# Docendo  874.0247  933.1399  983.5435  985.3697 1026.901 1053.407    50

推荐答案

这是另一个带有 slice 的选项,在这种情况下,它可以与 filter 类似地使用.主要区别在于您为 slice 提供了一个整数向量,而 filter 需要一个逻辑向量.

Here's another option with slice which can be used similarly to filter in this case. Main difference is that you supply an integer vector to slice whereas filter takes a logical vector.

df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))

我喜欢这种方法的一点是,因为我们在 rowSums 中使用了 select,您可以利用 select 提供的所有特殊功能,例如 matches.

What I like about this approach is that because we use select inside rowSums you can make use of all the special functions that select supplies, like matches for example.

让我们看看它与其他答案的比较:

Let's see how it compares to the other answers:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
    Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
    Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
    dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
    times = 50L,
    unit = "relative"
)

#Unit: relative
#     expr      min       lq   median       uq      max neval
#    Marat 1.304216 1.290695 1.290127 1.288473 1.290609    50
#  Richard 1.139796 1.146942 1.124295 1.159715 1.160689    50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50

编辑说明:更新了更可靠的基准测试,重复 50 次(次数 = 50L).

Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).

在评论基础 R 将具有与 slice 方法相同的速度之后(没有具体说明基础 R 方法的确切含义),我决定通过与基础 R 的比较来更新我的答案使用与我的答案几乎相同的方法.对于我使用的基础 R:

Following a comment that base R would have the same speed as the slice approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:

base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ]

基准:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
  Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
  Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
  dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
  base = df[!rowSums(df[-5L] < 2L), ],
  base_which = df[which(!rowSums(df[-5L] < 2L)), ],
  times = 50L,
  unit = "relative"
)

#Unit: relative
#       expr      min       lq   median       uq      max neval
#      Marat 1.265692 1.279057 1.298513 1.279167 1.203794    50
#    Richard 1.124045 1.160075 1.163240 1.169573 1.076267    50
#   dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50
#       base 2.784058 2.769062 2.710305 2.669699 2.576825    50
# base_which 1.458339 1.477679 1.451617 1.419686 1.412090    50

这两种基本的 R 方法并没有真正更好或可比的性能.

Not really any better or comparable performance with these two base R approaches.

编辑注释 #2: 添加了带有基本 R 选项的基准.

Edit note #2: added benchmark with base R options.

这篇关于根据特定值过滤 data.frame 的每一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆