来自 dplyr 的子集和过滤器之间的区别 [英] Difference between subset and filter from dplyr

查看:20
本文介绍了来自 dplyr 的子集和过滤器之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我看来,子集和过滤器(来自 dplyr)具有相同的结果.但我的问题是:在某些时候是否存在潜在差异,例如.速度,它可以处理的数据大小等?是否存在使用其中一种更好的情况?

It seems to me that subset and filter (from dplyr) are having the same result. But my question is: is there at some point a potential difference, for ex. speed, data sizes it can handle etc? Are there occasions that it is better to use one or the other?

示例:

library(dplyr)

df1<-subset(airquality, Temp>80 & Month > 5)
df2<-filter(airquality, Temp>80 & Month > 5)

summary(df1$Ozone)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
# 9.00   39.00   64.00   64.51   84.00  168.00      14 

summary(df2$Ozone)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
# 9.00   39.00   64.00   64.51   84.00  168.00      14 

推荐答案

它们确实产生了相同的结果,并且它们在概念上非常相似.

They are, indeed, producing the same result, and they are very similar in concept.

subset 的优点是它是基础 R 的一部分,不需要任何额外的包.对于小样本量,它似乎比 filter 快一点(在您的示例中快 6 倍,但以微秒为单位).

The advantage of subset is that it is part of base R and doesn't require any additional packages. With small sample sizes, it seems to be a bit faster than filter (6 times faster in your example, but that's measured in microseconds).

随着数据集的增长,filter 似乎在效率上占了上风.在 15,000 条记录时,filter 超过 subset 大约 300 微秒.在 153,000 条记录时,filter 的速度提高了三倍(以毫秒为单位).

As the data sets grow, filter seems gains the upper hand in efficiency. At 15,000 records, filter outpaces subset by about 300 microseconds. And at 153,000 records, filter is three times faster (measured in milliseconds).

所以就人类时间而言,我认为两者之间没有太大区别.

So in terms of human time, I don't think there's much difference between the two.

另一个优势(这是一个小众优势)是 filter 可以对 SQL 数据库进行操作,而无需将数据拉入内存.subset 根本不这样做.

The other advantage (and this is a bit of a niche advantage) is that filter can operate on SQL databases without pulling the data into memory. subset simply doesn't do that.

就我个人而言,我倾向于使用 filter,但这只是因为我已经在使用 dplyr 框架.如果您不处理内存不足的数据,则不会有太大区别.

Personally, I tend to use filter, but only because I'm already using the dplyr framework. If you aren't working with out-of-memory data, it won't make much of a difference.

library(dplyr)
library(microbenchmark)

# Original example
microbenchmark(
  df1<-subset(airquality, Temp>80 & Month > 5),
  df2<-filter(airquality, Temp>80 & Month > 5)
)

Unit: microseconds
   expr     min       lq     mean   median      uq      max neval cld
 subset  95.598 107.7670 118.5236 119.9370 125.949  167.443   100  a 
 filter 551.886 564.7885 599.4972 571.5335 594.993 2074.997   100   b


# 15,300 rows
air <- lapply(1:100, function(x) airquality) %>% bind_rows

microbenchmark(
  df1<-subset(air, Temp>80 & Month > 5),
  df2<-filter(air, Temp>80 & Month > 5)
)

Unit: microseconds
   expr      min        lq     mean   median       uq      max neval cld
 subset 1187.054 1207.5800 1293.718 1216.671 1257.725 2574.392   100   b
 filter  968.586  985.4475 1056.686 1023.862 1036.765 2489.644   100  a 

# 153,000 rows
air <- lapply(1:1000, function(x) airquality) %>% bind_rows

microbenchmark(
  df1<-subset(air, Temp>80 & Month > 5),
  df2<-filter(air, Temp>80 & Month > 5)
)

Unit: milliseconds
   expr       min        lq     mean    median        uq      max neval cld
 subset 11.841792 13.292618 16.21771 13.521935 13.867083 68.59659   100   b
 filter  5.046148  5.169164 10.27829  5.387484  6.738167 65.38937   100  a 

这篇关于来自 dplyr 的子集和过滤器之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆