如何有效保存和加载大清单 [英] How can I efficiently save and load a big list
问题描述
免责声明:你们中的许多人都指出了重复的帖子,我知道了,但是我认为这不是一个公平的重复,因为某些保存/加载方式对于数据框和列表可能有所不同.例如,包 fst
和 feather
可以在数据框架上使用,但不能在列表上使用.
我的问题仅针对列表.
我有一个约5000万个元素列表,我想将其保存到文件中以在不同的R会话之间共享.
我知道保存R的本机方法( save
, save.image
, saveRDS
).我的观点是:您是否还会在大型数据上使用这些功能?
保存并读回的最快方法是什么?(任何R可读格式都可以).
经过一些研究,看来基本的 saveRDS
函数没有真正的替代品,而且处理大列表的软件包也不多./p>
将列表另存为data.table/data.frame的列不起作用,与软件包 fst
和 feather
一起使用,它与包 data.table
一起工作.但是,当读回它时,它会成为强制使用 strsplit
或其最快替代方法 str_split
的角色.
我可以找到的唯一直接关注列表的软件包是 rlist
,但是与基本功能 saveRDS <相比,它不能加快列表从文件中读写文件的速度./code>,
readRDS
.
基准:
l<-lapply(1:10000000,函数(x){rnorm(sample(1:5,size = 1,replace = T))})dt_l<-data.table(l = as.list(l))microbenchmark :: microbenchmark(times = 5L,"data.table" = {fwrite(dt_l,"dt_l.csv")dt_l<-fread("dt_l.csv",sep =,",sep2 ="\\ |")l_load<-str_split(dt_l $ l,"\\ |")},"rlist" = {list.save(l,"l.rds")l_load<-list.load("l.rds")},"RDS_base" = {saveRDS(l,"l.rds")l_load<-readRDS("l.rds")})单位:秒expr min lq平均中位数uq max neval数据表18.30548 18.67964 18.98801 19.17744 19.19791 19.57956 5RDS_list.save 16.80936 16.81615 16.86114 16.84012 16.91770 16.92236 5RDS_base 16.90403 17.23784 18.62475 19.48391 19.60365 19.89431 5
Disclaimer:
Many of you pointed to a duplicated post, I was aware of it but I believe it's not a fair duplicate as some way of saving/loading might be different for data frames and lists. For instance the packages fst
and feather
work on data frames but not on lists.
My question is specific to lists.
I have a ~50M element list and I'd like to save it to a file to share it among different R sessions.
I know the native ways of saving in R (save
, save.image
, saveRDS
). My point was : would you still use these functions on big scale data?
What is the fastest way to save it and read it back? (any R readable format would be alright).
After some research it appears that there is no real alternative to the base saveRDS
function and not many packages dealing with large lists.
Saving a list as a column of a data.table/data.frame doesn't works with the packages fst
and feather
, it works with the package data.table
. However when reading it back it becomes a character compelling the use of strsplit
or its fastest alternative str_split
.
The only package directly focused on lists that i could find was rlist
, however it does not speed up list reading or writing from/to a file when compared to the base functions saveRDS
, readRDS
.
Benchmarks:
l <- lapply(1:10000000, function (x) {rnorm(sample(1:5, size = 1, replace = T))} )
dt_l <- data.table(l = as.list(l))
microbenchmark::microbenchmark(times = 5L,
"data.table" = { fwrite(dt_l, "dt_l.csv")
dt_l <- fread("dt_l.csv", sep = ",", sep2 = "\\|")
l_load <- str_split(dt_l$l, "\\|")
},
"rlist" = { list.save(l, "l.rds")
l_load <- list.load("l.rds")
},
"RDS_base" = { saveRDS(l, "l.rds")
l_load <- readRDS("l.rds")
}
)
Unit: seconds
expr min lq mean median uq max neval
data.table 18.30548 18.67964 18.98801 19.17744 19.19791 19.57956 5
RDS_list.save 16.80936 16.81615 16.86114 16.84012 16.91770 16.92236 5
RDS_base 16.90403 17.23784 18.62475 19.48391 19.60365 19.89431 5
这篇关于如何有效保存和加载大清单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!