为什么要在dplyr中将新名称重新分配给dataframe使其速度更快? [英] Why reassigning new name to dataframe in dplyr makes it faster?

查看:91
本文介绍了为什么要在dplyr中将新名称重新分配给dataframe使其速度更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对dplyr和data.table在我的data.frame上创建一个新变量并决定比较方法的时间不满意。

I was unhappy with the time dplyr and data.table were taking to create a new variable on my data.frame and decide to compare methods.

令我惊讶的是,将dplyr :: mutate()的结果重新分配给新的data.frame似乎比不这样做更快。

To my surprise, reassigning the results of dplyr::mutate() to a new data.frame seems to be faster than not doing so.

为什么会这样?

library(data.table)
library(tidyverse)


dt <- fread(".... data.csv") #load 200MB datafile

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)

a <- Sys.time()
dt1[, MONTH := month(as.Date(DATE))]
b <- Sys.time(); datatabletook <- b-a

c <- Sys.time()
dt_dplyr <- dt2 %>%
  mutate(MONTH = month(as.Date(DATE)))
d <- Sys.time(); dplyr_reassign_took <- d - c 

e <- Sys.time()
dt3 %>%
  mutate(MONTH = month(as.Date(DATE)))
f <- Sys.time(); dplyrtook <- f - e

datatabletook        = 17sec
dplyrtook            = 47sec
dplyr_reassign_took  = 17sec


推荐答案

有几种方法可以使用基准R进行基准测试

.t0 <- Sys.time()
    ...
.t1 <- Sys.time()
.t1 - t0    

 # or

 system.time({
     ...
 })

使用 Sys.time 方式,您正在将每一行发送到控制台,并且可能会看到每行打印一些返回值,如@Axeman所建议。使用 {...} ,只有一个返回值(括号内的最后一个结果)和 system.time 将抑制打印。

With the Sys.time way, you're sending each line to the console and may see some return value printed for each line, as @Axeman suggested. With {...}, there is only one return value (the last result inside the braces) and system.time will suppress it from printing.

如果打印成本很高,但不属于您要衡量的范围,则可以有所作为。

If the printing is costly enough but is not part of what you want to measure, it can make a difference.

有充分的理由更喜欢 system.time 而不是 Sys.time 进行基准测试;来自@MattDowle的评论:

There are good reasons to prefer system.time over Sys.time for benchmarking; from @MattDowle's comment:


i)它首先将gc排除在与随机gc和

i) it does a gc first excluded from the timing to isolate from random gc's and

ii)它包括个用户 sys 时间以及已用挂钟时间。

ii) it includes user and sys time as well as elapsed wall clock time.

Sys.time()的方式会在测试过程中通过在Chrome中读取电子邮件或使用Excel受到影响运行时,只要您使用 user 和<$ c $, system.time()方式就不会c> sys 部分结果。

The Sys.time() way will be affected by reading your email in Chrome or using Excel while the test runs, the system.time() way won't so long as you use the user and sys parts of the result.

这篇关于为什么要在dplyr中将新名称重新分配给dataframe使其速度更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆