加速RData加载 [英] Speed up RData load

查看:168
本文介绍了加速RData加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我查了几个相关的问题,比如这个

I've checked several related questions such is this

如何快速将数据加载到R?

我引用了最具评价答案的特定部分

I'm quoting specific part of the most rated answer


这取决于你想做什么以及如何进一步处理数据。在任何情况下,只要您始终需要相同的数据集,从二进制R对象加载总是会更快。这里的极限速度是硬盘的速度,而不是R 。二进制形式是工作空间中数据框的内部表示,因此不再需要转换

It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore

我真的这么认为。然而,生活就是在试验。我有一个包含igraph对象的1.22 GB文件。这就是说,我不认为我在这里找到的与对象类有关,主要是因为你甚至可以在调用library之前加载('file.RData')。

I really thought that. However, life is about experimenting. I have a 1.22 GB file containing an igraph object. That's said, i don't think what I found here is related to the object class, mainly because you can load('file.RData') even before you call "library".

此服务器中的磁盘非常酷。因为你可以检查内存的阅读时间

Disks in this server are pretty cool. As you can check in the reading time to memory

user@machine data$ pv mygraph.RData > /dev/null
1.22GB 0:00:03 [ 384MB/s] [==================================>] 100% `

然而,当我从R加载此数据时

However when I load this data from R

>system.time(load('mygraph.RData'))
   user  system   elapsed 
178.533  16.490   202.662

所以加载* .RData文件似乎比磁盘限制慢60倍,这应该意味着R实际上确实一些东西,而加载。

So it seems loading *.RData files is 60 times slower than disk limits, which should mean R actually does something while "load".

我有同样的感觉使用不同硬件的不同R版本,这只是这次我有耐心做基准测试(主要是因为有这么酷的磁盘存储,负载实际需要多长时间很糟糕)

I've got the same feeling using differentes R versions with different hardware, it's just this time I got patience to make benchmarking (mainly because with such a cool disk storage, it was terrible how long the load actually takes)

关于如何克服这个问题的任何想法?

Any ideas on how to overcome this?

回答后的想法

save(g,file="test.RData",compress=F)

现在文件以前是1.2GB,而不是1.22GB。在我的情况下,加载解压缩要快一点(磁盘不是我的瓶颈)

Now the file is 3.1GB against 1.22GB before. In my case, loading uncompress is a bit faster (disk is not my bottleneck by far)

> system.time(load('test.RData'))
user  system elapsed 
126.254   2.701 128.974 

将未压缩的文件读取到内存需要12秒,所以我确认大部分时间花在设置环境上

Reading the uncompressed file to memory takes like 12 seconds, so I confirm most the time is spent in setting the enviroment

我会回来的RDS结果,听起来很有趣

I'll be back with RDS results, sounds like interesting

这里我们是,正如所宣传的那样

Here we are, as prommised

system.time(saveRDS(g,file="test2.RData",compress=F))
user  system elapsed 
7.714   2.820  18.112 

我得到一个3.1GB就像保存未压缩,虽然md5sum不同,可能是因为 save 还存储对象名称

And I get a 3.1GB just like "save" uncompressed, although md5sum is different, probably because save also stores the object name

现在正在阅读...

> system.time(a<-readRDS('test2.RData'))
user  system elapsed 
41.902   2.166  44.077 

因此,结合两种想法(解压缩和RDS)运行速度提高了5倍。谢谢你的贡献!

So combining both ideas (uncompress and RDS) runs 5 times faster. Thanks for your contributions!

推荐答案

保存默认情况下压缩,因此需要额外的时间来解压缩文件。然后将较大的文件加载到内存中需要更长的时间。您的 pv 示例只是将压缩数据复制到内存中,这对您来说并不是很有用。 ; - )

save compresses by default, so it takes extra time to uncompress the file. Then it takes a bit longer to load the larger file into memory. Your pv example is just copying the compressed data to memory, which isn't very useful to you. ;-)

更新:

我测试了我的理论,这是不正确的(至少在我的Windows XP上)机器配备3.3Ghz CPU和7200RPM硬盘)。加载压缩文件的速度更快(可能是因为它减少了磁盘I / O)。

I tested my theory and it was incorrect (at least on my Windows XP machine with 3.3Ghz CPU and 7200RPM HDD). Loading compressed files is faster (probably because it reduces disk I/O).

额外的时间用于 RestoreToEnv (在 saveload.c 中)和/或 R_Unserialize (在 serialize.c )。因此,您可以通过更改这些文件来加快加载速度,或者使用 saveRDS 来单独保存 myGraph.RData中的对象然后以某种方式在多个R进程中使用 loadRDS 将数据加载到共享内存中......

The extra time is spent in RestoreToEnv (in saveload.c) and/or R_Unserialize (in serialize.c). So you could make loading faster by changing those files, or maybe by using saveRDS to individually save the objects in myGraph.RData then somehow using loadRDS across multiple R processes to load the data into shared memory...

这篇关于加速RData加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆