将大数据集加载到R中的最快方法和最快格式是什么 [英] What is the fastest way and fastest format for loading large data sets into R

查看:71
本文介绍了将大数据集加载到R中的最快方法和最快格式是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的数据集(未压缩约13GB),我需要重复加载它。第一次加载(并保存为其他格式)可能会非常慢,但是此后的每次加载应尽可能快。加载数据集的最快方式和最快格式是什么?

I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set?

我怀疑最佳选择是

 saveRDS(obj, file = 'bigdata.Rda', compress = FALSE)
 obj <- loadRDS('bigdata.Rda)

但这似乎比在 data.table fread 函数要慢c>包。情况并非如此,因为 fread 可以从CSV转换文件(尽管它是经过高度优化的)。

But this seems slower than using fread function in the data.table package. This should not be the case because fread converts a file from CSV (although it is admittedly highly optimized).

大约800MB的数据集是:

Some timings for a ~800MB dataset are:

> system.time(tmp <- fread("data.csv"))
Read 6135344 rows and 22 (of 22) columns from 0.795 GB file in 00:00:43
     user  system elapsed 
     36.94    0.44   42.71 
 saveRDS(tmp, file = 'tmp.Rda'))
> system.time(tmp <- readRDS('tmp.Rda'))
     user  system elapsed 
     69.96    2.02   84.04


先前的问题


这个问题是相关的,但并不反映R的当前状态,例如,答案表明从二进制格式读取总是比文本格式快。在我看来,使用* SQL的建议也无济于事,因为需要整个数据集,而不仅仅是整个数据集。

Previous Questions

This question is related but does not reflect the current state of R, for example an answer suggests reading from a binary format will always be faster than a text format. The suggestion to use *SQL is also not helpful in my case as the entire data set is required, not just a subset of it.

关于最快的加载方式也有相关的问题数据一次(例如: 1 )。

There are also related questions about the fastest way of loading data once (eg: 1).

推荐答案

这取决于您打算如何处理数据。如果您希望将整个数据存储在内存中以进行某些操作,那么我想最好的选择是fread或readRDS(如果对您而言重要,RDS中保存的数据的文件大小要小得多)。

It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).

如果您要对数据进行汇总操作,我发现一次转换为数据库(使用sqldf)是一个更好的选择,因为通过对数据执行sql查询,后续操作会更快,但这也是因为我没有足够的RAM来在内存中加载13 GB的文件。

If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.

这篇关于将大数据集加载到R中的最快方法和最快格式是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆