根据特定值过滤数据框架的每一列 [英] Filter each column of a data.frame based on a specific value

查看:82
本文介绍了根据特定值过滤数据框架的每一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下数据框架:

  df<  -  data.frame(复制(5,示例(1:10 ,10,rep = TRUE)))

#X1 X2 X3 X4 X5
#1 7 9 8 4 10
#2 2 4 9 4 9
# 3 2 7 8 8 6
#4 8 9 6 6 4
#5 5 2 1 4 6
#6 8 2 2 1 7
#7 3 8 6 1 6
#8 3 8 5 9 8
#9 6 2 3 10 7
#10 2 7 4 2 9

使用 dplyr ,如何在大于2的所有值上对每个列进行过滤(无需隐式命名)。



可以模拟假设 filter_each(funs(。> = 2))的东西



现在我在做:

  df%>%filter(X1> ; = 2,X2 = 2,X3 = 2,X4 = 2,X5 = 2)

相当于:

  df%>%filter(!rowSums(。< 2 ))

注意:假设我只想过滤o在前4列,我会做:

  df%>%filter(X1> = 2,X2> = 2,X3> = 2,X4 = 2)

  df%>%filter(!rowSums(。[ -  5] 2))

会有更有效的替代方法吗?



编辑:子问题



如何指定列名称并模拟假设 filter_each(funs(。> = 2),-X5)



基准子问题



由于我必须在大型数据集上运行,所以我对这些建议进行了基准测试。

  df<  -  data.frame(replicate(5,sample(1:10,10e6,rep = TRUE))) 

mbm< - microbenchmark(
Marat = df%>%filter(!rowSums(。[,!colnames(。)%in%X5,drop = FALSE] ; 2)),
Richard = filter_(df,.dots = lapply(names(df)[names(df)!=X5],function(x,y){call(> = ,as.name(x),y​​)},2)),
Docendo = df%>%slice(其中(!rowSums(select(。,-matches(X5))< 2L) )),
times = 50

以下是结果: / p>

  #Unit:milliseconds 
#expr min lq mean median uq max neval
#Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458 50
#理查德1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669 50
#Docendo 874.0247 933.1399 983.5435 985.3697 1026.901 1053.407 50

解决方案

这是另一个选项, slice 可以类似于 filter 在这种情况下。主要区别在于,您向切片提供整数向量,而过滤器采用逻辑向量。

  df%>%slice(其中(!rowSums(select(。,-matches(X5))< 2L))

我喜欢这种方法是因为我们使用选择 rowSums 中,您可以使用选择耗材的所有特殊功能,例如匹配例如。






让我们看看它如何与其他答案进行比较:

  df<  -  data.frame(replicate(5,sample(1:10,10e6,rep = TRUE)))

mbm< - microbenchmark(
Marat = df%>%filter(!rowSums(。[,!colnames(。)%in%X5,drop = FALSE]< 2) ),
Richard = filter_(df,.dots = lapply(names(df)[names(df)!=X5],function(x,y){call(> = name(x),y​​)},2)),
dd_slice = df%>%slice(其中(!rowSums(select(。,-matches(X5))< 2L)
times = 50L,
unit =relative


#Unit:relative
#expr min lq median uq max neval
#Marat 1.304216 1.290695 1.290127 1.288473 1.290609 50
#Richard 1.139796 1.146942 1.124295 1.159715 1.160689 50
#dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50






接下来的一个评论,基地R将具有与切片方法相同的速度(没有什么基本的R方法的意思确切),我决定更新我的答案与基础R的比较使用几乎相同的方法在我的答案。对于使用的基本RI:

  base = df [!rowSums(df [-5L]< 2L),],
base_which = df [which(!rowSums(df [-5L]< 2L))]]

基准:

  df<  -  data.frame(replicate(5,sample(1:10,10e6,rep = TRUE)))

mbm< - microbenchmark(
Marat = df%>%filter(!rowSums(。[,!colnames(。)%in%X5 = fALSE]< 2)),
Richard = filter_(df,.dots = lapply(names(df)[names(df)!=X5],function(x,y){call > =,as.name(x),y​​)},2)),
dd_slice = df%>%slice(其中(!rowSums(select(。,-matches(X5)) < 2L))),
base = df [!rowSums(df [-5L] <2L),],
base_which = df(其中(!rowSums(df [-5L] 2L)),],
times = 50L,
unit =relative


#Unit:relative
#expr min lq median uq max neval
#Marat 1.265692 1.279057 1.298513 1.279167 1.203794 50
#Richard 1.124045 1.160075 1.163240 1.169573 1。 076267 50
#dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
#base 2.784058 2.769062 2.710305 2.669699 2.576825 50
#base_which 1.458339 1.477679 1.451617 1.419686 1.412090 50
/ pre>



这两种基本的R方法并没有太好的或者可比的表现。



编辑笔记#2:与基本R选项。


Consider the following data frame:

df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE)))

#   X1 X2 X3 X4 X5
#1   7  9  8  4 10
#2   2  4  9  4  9
#3   2  7  8  8  6
#4   8  9  6  6  4
#5   5  2  1  4  6
#6   8  2  2  1  7
#7   3  8  6  1  6
#8   3  8  5  9  8
#9   6  2  3 10  7
#10  2  7  4  2  9

Using dplyr, how can I filter, on each column (without implicitly naming them), for all values greater than 2.

Something that would mimic an hypothetical filter_each(funs(. >= 2))

Right now I'm doing:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2)

Which is equivalent to:

df %>% filter(!rowSums(. < 2))

Note: Let's say I wanted to filter only on the first 4 columns, I would do:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2) 

or

df %>% filter(!rowSums(.[-5] < 2))

Would there be a more efficient alternative ?

Edit: sub question

How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

Benchmark sub question

Since I have to run this on a large dataset, I benchmarked the suggestions.

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50
)

Here are the results:

#Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval
#   Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458    50
# Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669    50
# Docendo  874.0247  933.1399  983.5435  985.3697 1026.901 1053.407    50

解决方案

Here's another option with slice which can be used similarly to filter in this case. Main difference is that you supply an integer vector to slice whereas filter takes a logical vector.

df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))

What I like about this approach is that because we use select inside rowSums you can make use of all the special functions that select supplies, like matches for example.


Let's see how it compares to the other answers:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
    Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
    Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
    dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
    times = 50L,
    unit = "relative"
)

#Unit: relative
#     expr      min       lq   median       uq      max neval
#    Marat 1.304216 1.290695 1.290127 1.288473 1.290609    50
#  Richard 1.139796 1.146942 1.124295 1.159715 1.160689    50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50

Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).


Following a comment that base R would have the same speed as the slice approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:

base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ]

Benchmark:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
  Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
  Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
  dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
  base = df[!rowSums(df[-5L] < 2L), ],
  base_which = df[which(!rowSums(df[-5L] < 2L)), ],
  times = 50L,
  unit = "relative"
)

#Unit: relative
#       expr      min       lq   median       uq      max neval
#      Marat 1.265692 1.279057 1.298513 1.279167 1.203794    50
#    Richard 1.124045 1.160075 1.163240 1.169573 1.076267    50
#   dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50
#       base 2.784058 2.769062 2.710305 2.669699 2.576825    50
# base_which 1.458339 1.477679 1.451617 1.419686 1.412090    50

Not really any better or comparable performance with these two base R approaches.

Edit note #2: added benchmark with base R options.

这篇关于根据特定值过滤数据框架的每一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆