为什么在数据框架中预先指定类型较慢? [英] Why is it slower to prespecify type in a data.frame?

查看:147
本文介绍了为什么在数据框架中预先指定类型较慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在预先分配一个大的数据框,以便稍后填写,我通常用 NA 这样做:

  n < -  1e6 
a < - data.frame(c1 = 1:n,c2 = NA,c3 = NA)

,我想知道如果我在前面指定了数据类型,是否会使事情更快,所以我测试了

  f1<  -  function(){
a < - data.frame(c1 = 1:n,c2 = NA,c3 = NA)
a $ c2 < - 1:n
a $ c3 < - sample(LETTERS,size = n,replace = TRUE)
}

f2< ; - function(){
b< - data.frame(c1 = 1:n,c2 = numeric(n),c3 = character(n))
b $ c2 < b $ bb $ c3< - sample(LETTERS,size = n,replace = TRUE)
}

> system.time(f1())
用户系统已用
0.219 0.042 0.260
> system.time(f2())
用户系统已用
1.018 0.052 1.072



NA



-



编辑:Flodel指出那1:n是整数,不是数字。通过这种修正,运行时间几乎相同;当然,它不正确地指定数据类型并稍后更改!

解决方案

将任何数据分配给大数据帧需要时间。如果您要在矢量中一次性分配您的数据(如您所愿),则不要更快地将原始定义中的c2和c3列分配给。例如:

  f3<  -  function(){
c< - data.frame(c1 = n)
c $ c2< - 1:n
c $ c3< - sample(LETTERS,size = n,replace = TRUE)
}

print system.time(f1()))
#用户系统经过
#0.194 0.023 0.216
打印(system.time(f2()))
#用户系统已用
#0.336 0.037 0.374
print(system.time(f3()))
#用户系统已用
#0.057 0.007 0.063
这样做的原因是当您预分配时,会创建一列长度为 n 的列。例如

  str(data.frame(x = 1:2,y = character(2)))
# #'data.frame':2 obs。的2个变量:
## $ x:int 1 2
## $ y:因子w / 1级别:1 1

请注意,字符列已转换为因子 将比设置 stringsAsFactors = F 慢。


I was preallocating a big data.frame to fill in later, which I normally do with NA's like this:

n <- 1e6
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)

and I wondered if it would make things any faster later if I specified data types up front, so I tested

f1 <- function() {
    a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
    a$c2 <- 1:n
    a$c3 <- sample(LETTERS, size= n, replace = TRUE)
}

f2 <- function() {
    b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
    b$c2 <- 1:n
    b$c3 <- sample(LETTERS, size= n, replace = TRUE)
}

> system.time(f1())
   user  system elapsed 
  0.219   0.042   0.260 
> system.time(f2())
   user  system elapsed 
  1.018   0.052   1.072 

So it was actually much slower! I tried again with a factor column too, and the difference wasn't closer to 2x than 4x, but I'm curious about why this is slower, and wonder if it is ever appropriate to initialize with data types rather than NA's.

--

Edit: Flodel pointed out that 1:n is integer, not numeric. With that correction the runtimes are nearly identical; of course it hurts to incorrectly specify a data type and change it later!

解决方案

Assigning any data to a large data frame takes time. If you're going to assign your data all at once in a vector (as you should), it's much faster not to assign the c2 and c3 columns in the original definition at all. For example:

f3 <- function() {
    c <- data.frame(c1 = 1:n)
    c$c2 <- 1:n
    c$c3 <- sample(LETTERS, size= n, replace = TRUE)
}

print(system.time(f1()))
#   user  system elapsed 
#  0.194   0.023   0.216 
print(system.time(f2()))
#   user  system elapsed 
#  0.336   0.037   0.374 
print(system.time(f3()))
#   user  system elapsed 
#  0.057   0.007   0.063 

The reason for this is that when you preassign, a column of length n is created. eg

str(data.frame(x=1:2, y = character(2)))
## 'data.frame':    2 obs. of  2 variables:
## $ x: int  1 2
## $ y: Factor w/ 1 level "": 1 1

Note that the character column has been converted to factor which will be slower than setting stringsAsFactors = F.

这篇关于为什么在数据框架中预先指定类型较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆