将大数据集加载到R中的最快方法和最快格式是什么 [英] What is the fastest way and fastest format for loading large data sets into R
问题描述
我有一个很大的数据集(未压缩约13GB),我需要重复加载它。第一次加载(并保存为其他格式)可能会非常慢,但是此后的每次加载应尽可能快。加载数据集的最快方式和最快格式是什么?
I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set?
我怀疑最佳选择是
saveRDS(obj, file = 'bigdata.Rda', compress = FALSE)
obj <- loadRDS('bigdata.Rda)
但这似乎比在 data.table $ c $中使用
fread
函数要慢c>包。情况并非如此,因为 fread
可以从CSV转换文件(尽管它是经过高度优化的)。
But this seems slower than using fread
function in the data.table
package. This should not be the case because fread
converts a file from CSV (although it is admittedly highly optimized).
大约800MB的数据集是:
Some timings for a ~800MB dataset are:
> system.time(tmp <- fread("data.csv"))
Read 6135344 rows and 22 (of 22) columns from 0.795 GB file in 00:00:43
user system elapsed
36.94 0.44 42.71
saveRDS(tmp, file = 'tmp.Rda'))
> system.time(tmp <- readRDS('tmp.Rda'))
user system elapsed
69.96 2.02 84.04
先前的问题
这个问题是相关的,但并不反映R的当前状态,例如,答案表明从二进制格式读取总是比文本格式快。在我看来,使用* SQL的建议也无济于事,因为需要整个数据集,而不仅仅是整个数据集。
Previous Questions
This question is related but does not reflect the current state of R, for example an answer suggests reading from a binary format will always be faster than a text format. The suggestion to use *SQL is also not helpful in my case as the entire data set is required, not just a subset of it.
关于最快的加载方式也有相关的问题数据一次(例如: 1 )。
There are also related questions about the fastest way of loading data once (eg: 1).
推荐答案
这取决于您打算如何处理数据。如果您希望将整个数据存储在内存中以进行某些操作,那么我想最好的选择是fread或readRDS(如果对您而言重要,RDS中保存的数据的文件大小要小得多)。
It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).
如果您要对数据进行汇总操作,我发现一次转换为数据库(使用sqldf)是一个更好的选择,因为通过对数据执行sql查询,后续操作会更快,但这也是因为我没有足够的RAM来在内存中加载13 GB的文件。
If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.
这篇关于将大数据集加载到R中的最快方法和最快格式是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!