为什么在数据框架中预先指定类型较慢? [英] Why is it slower to prespecify type in a data.frame?
问题描述
我正在预先分配一个大的数据框,以便稍后填写,我通常用 NA
这样做:
n < - 1e6
a < - data.frame(c1 = 1:n,c2 = NA,c3 = NA)
,我想知道如果我在前面指定了数据类型,是否会使事情更快,所以我测试了
f1< - function(){
a < - data.frame(c1 = 1:n,c2 = NA,c3 = NA)
a $ c2 < - 1:n
a $ c3 < - sample(LETTERS,size = n,replace = TRUE)
}
f2< ; - function(){
b< - data.frame(c1 = 1:n,c2 = numeric(n),c3 = character(n))
b $ c2 < b $ bb $ c3< - sample(LETTERS,size = n,replace = TRUE)
}
> system.time(f1())
用户系统已用
0.219 0.042 0.260
> system.time(f2())
用户系统已用
1.018 0.052 1.072
NA 的
-
编辑:Flodel指出那1:n是整数,不是数字。通过这种修正,运行时间几乎相同;当然,它不正确地指定数据类型并稍后更改!
将任何数据分配给大数据帧需要时间。如果您要在矢量中一次性分配您的数据(如您所愿),则不要更快地将原始定义中的c2和c3列分配给。例如:
f3< - function(){
这样做的原因是当您预分配时,会创建一列长度为
c< - data.frame(c1 = n)
c $ c2< - 1:n
c $ c3< - sample(LETTERS,size = n,replace = TRUE)
}
print system.time(f1()))
#用户系统经过
#0.194 0.023 0.216
打印(system.time(f2()))
#用户系统已用
#0.336 0.037 0.374
print(system.time(f3()))
#用户系统已用
#0.057 0.007 0.063
n
的列。例如str(data.frame(x = 1:2,y = character(2)))
# #'data.frame':2 obs。的2个变量:
## $ x:int 1 2
## $ y:因子w / 1级别:1 1
请注意,
字符
列已转换为因子
将比设置stringsAsFactors = F
慢。I was preallocating a big data.frame to fill in later, which I normally do with
NA
's like this:n <- 1e6 a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
and I wondered if it would make things any faster later if I specified data types up front, so I tested
f1 <- function() { a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA) a$c2 <- 1:n a$c3 <- sample(LETTERS, size= n, replace = TRUE) } f2 <- function() { b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n)) b$c2 <- 1:n b$c3 <- sample(LETTERS, size= n, replace = TRUE) } > system.time(f1()) user system elapsed 0.219 0.042 0.260 > system.time(f2()) user system elapsed 1.018 0.052 1.072
So it was actually much slower! I tried again with a factor column too, and the difference wasn't closer to 2x than 4x, but I'm curious about why this is slower, and wonder if it is ever appropriate to initialize with data types rather than
NA
's.--
Edit: Flodel pointed out that 1:n is integer, not numeric. With that correction the runtimes are nearly identical; of course it hurts to incorrectly specify a data type and change it later!
解决方案Assigning any data to a large data frame takes time. If you're going to assign your data all at once in a vector (as you should), it's much faster not to assign the c2 and c3 columns in the original definition at all. For example:
f3 <- function() { c <- data.frame(c1 = 1:n) c$c2 <- 1:n c$c3 <- sample(LETTERS, size= n, replace = TRUE) } print(system.time(f1())) # user system elapsed # 0.194 0.023 0.216 print(system.time(f2())) # user system elapsed # 0.336 0.037 0.374 print(system.time(f3())) # user system elapsed # 0.057 0.007 0.063
The reason for this is that when you preassign, a column of length
n
is created. egstr(data.frame(x=1:2, y = character(2))) ## 'data.frame': 2 obs. of 2 variables: ## $ x: int 1 2 ## $ y: Factor w/ 1 level "": 1 1
Note that the
character
column has been converted tofactor
which will be slower than settingstringsAsFactors = F
.这篇关于为什么在数据框架中预先指定类型较慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!