根据特定值过滤数据框架的每一列 [英] Filter each column of a data.frame based on a specific value
问题描述
考虑以下数据框架:
df< - data.frame(复制(5,示例(1:10 ,10,rep = TRUE)))
#X1 X2 X3 X4 X5
#1 7 9 8 4 10
#2 2 4 9 4 9
# 3 2 7 8 8 6
#4 8 9 6 6 4
#5 5 2 1 4 6
#6 8 2 2 1 7
#7 3 8 6 1 6
#8 3 8 5 9 8
#9 6 2 3 10 7
#10 2 7 4 2 9
使用 dplyr
,如何在大于2的所有值上对每个列进行过滤(无需隐式命名)。
可以模拟假设 filter_each(funs(。> = 2))的东西
现在我在做:
df%>%filter(X1> ; = 2,X2 = 2,X3 = 2,X4 = 2,X5 = 2)
相当于:
df%>%filter(!rowSums(。< 2 ))
注意:假设我只想过滤o在前4列,我会做:
df%>%filter(X1> = 2,X2> = 2,X3> = 2,X4 = 2)
或
df%>%filter(!rowSums(。[ - 5] 2))
会有更有效的替代方法吗?
编辑:子问题
如何指定列名称并模拟假设 filter_each(funs(。> = 2),-X5)
?
基准子问题
由于我必须在大型数据集上运行,所以我对这些建议进行了基准测试。
df< - data.frame(replicate(5,sample(1:10,10e6,rep = TRUE)))
mbm< - microbenchmark(
Marat = df%>%filter(!rowSums(。[,!colnames(。)%in%X5,drop = FALSE] ; 2)),
Richard = filter_(df,.dots = lapply(names(df)[names(df)!=X5],function(x,y){call(> = ,as.name(x),y)},2)),
Docendo = df%>%slice(其中(!rowSums(select(。,-matches(X5))< 2L) )),
times = 50
)
以下是结果: / p>
#Unit:milliseconds
#expr min lq mean median uq max neval
#Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458 50
#理查德1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669 50
#Docendo 874.0247 933.1399 983.5435 985.3697 1026.901 1053.407 50
这是另一个选项, slice
可以类似于 filter
在这种情况下。主要区别在于,您向切片
提供整数向量,而过滤器
采用逻辑向量。
df%>%slice(其中(!rowSums(select(。,-matches(X5))< 2L))
我喜欢这种方法是因为我们使用选择
在 rowSums
中,您可以使用选择
耗材的所有特殊功能,例如匹配
例如。
让我们看看它如何与其他答案进行比较:
df< - data.frame(replicate(5,sample(1:10,10e6,rep = TRUE)))
mbm< - microbenchmark(
Marat = df%>%filter(!rowSums(。[,!colnames(。)%in%X5,drop = FALSE]< 2) ),
Richard = filter_(df,.dots = lapply(names(df)[names(df)!=X5],function(x,y){call(> = name(x),y)},2)),
dd_slice = df%>%slice(其中(!rowSums(select(。,-matches(X5))< 2L)
times = 50L,
unit =relative
)
#Unit:relative
#expr min lq median uq max neval
#Marat 1.304216 1.290695 1.290127 1.288473 1.290609 50
#Richard 1.139796 1.146942 1.124295 1.159715 1.160689 50
#dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
接下来的一个评论,基地R将具有与切片
方法相同的速度(没有什么基本的R方法的意思确切),我决定更新我的答案与基础R的比较使用几乎相同的方法在我的答案。对于使用的基本RI:
base = df [!rowSums(df [-5L]< 2L),],
base_which = df [which(!rowSums(df [-5L]< 2L))]]
基准:
df< - data.frame(replicate(5,sample(1:10,10e6,rep = TRUE)))
/ pre>
mbm< - microbenchmark(
Marat = df%>%filter(!rowSums(。[,!colnames(。)%in%X5 = fALSE]< 2)),
Richard = filter_(df,.dots = lapply(names(df)[names(df)!=X5],function(x,y){call > =,as.name(x),y)},2)),
dd_slice = df%>%slice(其中(!rowSums(select(。,-matches(X5)) < 2L))),
base = df [!rowSums(df [-5L] <2L),],
base_which = df(其中(!rowSums(df [-5L] 2L)),],
times = 50L,
unit =relative
)
#Unit:relative
#expr min lq median uq max neval
#Marat 1.265692 1.279057 1.298513 1.279167 1.203794 50
#Richard 1.124045 1.160075 1.163240 1.169573 1。 076267 50
#dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
#base 2.784058 2.769062 2.710305 2.669699 2.576825 50
#base_which 1.458339 1.477679 1.451617 1.419686 1.412090 50
这两种基本的R方法并没有太好的或者可比的表现。
编辑笔记#2:与基本R选项。
Consider the following data frame:
df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE))) # X1 X2 X3 X4 X5 #1 7 9 8 4 10 #2 2 4 9 4 9 #3 2 7 8 8 6 #4 8 9 6 6 4 #5 5 2 1 4 6 #6 8 2 2 1 7 #7 3 8 6 1 6 #8 3 8 5 9 8 #9 6 2 3 10 7 #10 2 7 4 2 9
Using
dplyr
, how can I filter, on each column (without implicitly naming them), for all values greater than 2.Something that would mimic an hypothetical
filter_each(funs(. >= 2))
Right now I'm doing:
df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2)
Which is equivalent to:
df %>% filter(!rowSums(. < 2))
Note: Let's say I wanted to filter only on the first 4 columns, I would do:
df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2)
or
df %>% filter(!rowSums(.[-5] < 2))
Would there be a more efficient alternative ?
Edit: sub question
How to specify a column name and mimic an hypothethical
filter_each(funs(. >= 2), -X5)
?Benchmark sub question
Since I have to run this on a large dataset, I benchmarked the suggestions.
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE))) mbm <- microbenchmark( Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)), Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)), Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))), times = 50 )
Here are the results:
#Unit: milliseconds # expr min lq mean median uq max neval # Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458 50 # Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669 50 # Docendo 874.0247 933.1399 983.5435 985.3697 1026.901 1053.407 50
解决方案Here's another option with
slice
which can be used similarly tofilter
in this case. Main difference is that you supply an integer vector toslice
whereasfilter
takes a logical vector.df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))
What I like about this approach is that because we use
select
insiderowSums
you can make use of all the special functions thatselect
supplies, likematches
for example.
Let's see how it compares to the other answers:
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE))) mbm <- microbenchmark( Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)), Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)), dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))), times = 50L, unit = "relative" ) #Unit: relative # expr min lq median uq max neval # Marat 1.304216 1.290695 1.290127 1.288473 1.290609 50 # Richard 1.139796 1.146942 1.124295 1.159715 1.160689 50 # dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).
Following a comment that base R would have the same speed as the
slice
approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:base = df[!rowSums(df[-5L] < 2L), ], base_which = df[which(!rowSums(df[-5L] < 2L)), ]
Benchmark:
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE))) mbm <- microbenchmark( Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)), Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)), dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))), base = df[!rowSums(df[-5L] < 2L), ], base_which = df[which(!rowSums(df[-5L] < 2L)), ], times = 50L, unit = "relative" ) #Unit: relative # expr min lq median uq max neval # Marat 1.265692 1.279057 1.298513 1.279167 1.203794 50 # Richard 1.124045 1.160075 1.163240 1.169573 1.076267 50 # dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50 # base 2.784058 2.769062 2.710305 2.669699 2.576825 50 # base_which 1.458339 1.477679 1.451617 1.419686 1.412090 50
Not really any better or comparable performance with these two base R approaches.
Edit note #2: added benchmark with base R options.
这篇关于根据特定值过滤数据框架的每一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!