挑战:重新编码一个data.frame() - 使它更快 [英] Challenge: recoding a data.frame() — make it faster
问题描述
通过我的机器上的 system.time()
提供的示例数据完成相同任务的最快代码胜利。
##样本数据
dat < - cbind(rep(1:5,50000),rep :150000),rep(c(1,2,4,5,3),50000))
dat < - cbind(dat,dat,dat,dat,dat,dat,dat,dat, dat,dat,dat,dat)
dat < - as.data.frame(dat)
re.codes< - c(This,That,And, ,其他)
优化代码。
$ b $ (x in 1:ncol(dat)){pre>
{
dat [,x] < - factor(dat [,x],labels = re.codes )
}
目前的 system.time()
用户系统已用
4.40 0.10 4.49
提示: dat < - lapply(1:ncol(dat),function(x)dat [,x] - 因素(dat [,x],labels = rc)))
不会更快。
p>组合 @ DWin的答案,以及我的答案来自对data.frame方法的最有效的列表?:
系统。$($ {
dat3 < - list()
#定义一个循环外的属性
attrib< - list(class =factor,levels = re.codes)
for(i in names(dat)){#循环遍历'dat'中的每一列
dat3 [[i]]< - as.integer(dat [[i]])#将列转换为整数
属性(dat3 [[i]])< - attrib#assign factor attributes
}
#convert'dat3'into a data.frame。我们可以这样做,因为:
#1)我们知道'dat'和'dat3'具有相同数量的行和列
#2)我们要'dat3'具有相同的名称'dat'
#3)我们不在乎'dat3'与'dat'有不同的rownames
属性(dat3)< - list(row.names = c(NA_integer_,nrow(dat )),
class =data.frame,names = names(dat))
})
same(dat2,dat3)#'dat2'来自@ Dwin的答案
Recoding is a common practice for survey data, but the most obvious routes take more time than they should.
The fastest code that accomplishes the same task with the provided sample data by system.time()
on my machine wins.
## Sample data
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")
Code to optimize.
for(x in 1:ncol(dat)) {
dat[,x] <- factor(dat[,x], labels=re.codes)
}
Current system.time()
:
user system elapsed
4.40 0.10 4.49
Hint: dat <- lapply(1:ncol(dat), function(x) dat[,x] <- factor(dat[,x],labels=rc)))
is not any faster.
Combining @DWin's answer, and my answer from Most efficient list to data.frame method?:
system.time({
dat3 <- list()
# define attributes once outside of loop
attrib <- list(class="factor", levels=re.codes)
for (i in names(dat)) { # loop over each column in 'dat'
dat3[[i]] <- as.integer(dat[[i]]) # convert column to integer
attributes(dat3[[i]]) <- attrib # assign factor attributes
}
# convert 'dat3' into a data.frame. We can do it like this because:
# 1) we know 'dat' and 'dat3' have the same number of rows and columns
# 2) we want 'dat3' to have the same colnames as 'dat'
# 3) we don't care if 'dat3' has different rownames than 'dat'
attributes(dat3) <- list(row.names=c(NA_integer_,nrow(dat)),
class="data.frame", names=names(dat))
})
identical(dat2, dat3) # 'dat2' is from @Dwin's answer
这篇关于挑战:重新编码一个data.frame() - 使它更快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!