dplyr中的子集和过滤器之间的区别 [英] Difference between subset and filter from dplyr

查看：217 发布时间：2017/11/8 20:08:15 r filter subset

本文介绍了dplyr中的子集和过滤器之间的区别的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我看来，子集和过滤器（来自dplyr）具有相同的结果。
但是我的问题是：是否有某种潜在的差异，例如。速度，数据大小，它可以处理等？例如：

（airquality，Temp> 80& Month> 5）
df2< -filter（airquality，Temp> 80&月份> 5）

摘要（df1 $臭氧）
＃最小值第一曲中位数均值3曲。最大。 NA
＃9.00 39.00 64.00 64.51 84.00 168.00 14

总结（df2 $臭氧）
＃最小值第一曲中位数均值3曲。最大。 NA's
＃9.00 39.00 64.00 64.51 84.00 168.00 14

解决方案

<他们确实产生了相同的结果，他们在概念上非常相似。

subset 是它是基本R的一部分，不需要任何额外的软件包。对于小样本，它似乎比 filter 要快（在你的例子中，速度要快6倍，但是以微秒为单位）

随着数据集的增长， filter 似乎在效率上占据上风。在15,000条记录中，约300微秒， filter outpaces subset 。在153,000条记录中， filter 的速度提高了三倍（以毫秒为单位）。，我不认为两者有很大的区别。
$ b

另一个优点是（code> filter 可以在SQL数据库上运行无需将数据拉入内存。 subset 根本就没有这样做。

我个人倾向于使用 filter ，但只是因为我已经在使用 dplyr 框架了。如果你没有使用内存不足的数据，那么它就不会有太大的区别。
library（ dplyr） library（microbenchmark）＃原始示例 microbenchmark（ df1< -subset（airquality，Temp> 80& Month> 5）， df2< -filter（airquality，Temp> 80& Month> 5））单位：微秒 expr min lq平均值uq max neval cld 子集95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a 过滤器551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b $ b＃15 300行空气< - 供应（ 1：100，函数（x）空气质量）％>％bind_rows microbenchmark（ df1< -subset（air，Temp> 80& Month> 5）， df2< -filter（空气，温度> 80&月> 5））单位：微秒 expr min lq平均值uq max neval cld 子集1187.054 1207.5800 1293.7 18 1216.671 1257.725 2574.392 100 b filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a ＃153,000 rows air < - lapply（1：1000，function（x）airquality）％>％bind_rows microbenchmark（ df1< -subset（air，Temp> 80&月份> 5， df2< -filter（空气，温度> 80&月> 5））单位：毫秒 expr min lq平均值uq max neval cld 子集11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b 过滤器5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a

It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other?

Example:
library(dplyr) df1<-subset(airquality, Temp>80 & Month > 5) df2<-filter(airquality, Temp>80 & Month > 5) summary(df1$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14 summary(df2$Ozone) # Min. 1st Qu. Median Mean 3rd Qu. Max. NA's # 9.00 39.00 64.00 64.51 84.00 168.00 14

解决方案
They are, indeed, producing the same result, and they are very similar in concept.

The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

So in terms of human time, I don't think there's much difference between the two.

The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.
library(dplyr) library(microbenchmark) # Original example microbenchmark( df1<-subset(airquality, Temp>80 & Month > 5), df2<-filter(airquality, Temp>80 & Month > 5) ) Unit: microseconds expr min lq mean median uq max neval cld subset 95.598 107.7670 118.5236 119.9370 125.949 167.443 100 a filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997 100 b # 15,300 rows air <- lapply(1:100, function(x) airquality) %>% bind_rows microbenchmark( df1<-subset(air, Temp>80 & Month > 5), df2<-filter(air, Temp>80 & Month > 5) ) Unit: microseconds expr min lq mean median uq max neval cld subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392 100 b filter 968.586 985.4475 1056.686 1023.862 1036.765 2489.644 100 a # 153,000 rows air <- lapply(1:1000, function(x) airquality) %>% bind_rows microbenchmark( df1<-subset(air, Temp>80 & Month > 5), df2<-filter(air, Temp>80 & Month > 5) ) Unit: milliseconds expr min lq mean median uq max neval cld subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659 100 b filter 5.046148 5.169164 10.27829 5.387484 6.738167 65.38937 100 a

这篇关于dplyr中的子集和过滤器之间的区别的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

dplyr中的子集和过滤器之间的区别 [英] Difference between subset and filter from dplyr

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

dplyr中的子集和过滤器之间的区别 [英] Difference between subset and filter from dplyr

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭