R无法在ff过程上分配内存.怎么会? [英] R could not allocate memory on ff procedure. How come?

查看:303
本文介绍了R无法在ff过程上分配内存.怎么会?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用具有Intel Xeon处理器和24 GB RAM的64位Windows Server 2008计算机.我在尝试读取11 GB(> 2400万行,20列)的特定TSV(制表符分隔)文件时遇到了麻烦.我通常的同伴read.table使我失败了.我目前正在通过以下过程尝试软件包ff:

I'm working on a 64-bit Windows Server 2008 machine with Intel Xeon processor and 24 GB of RAM. I'm having trouble trying to read a particular TSV (tab-delimited) file of 11 GB (>24 million rows, 20 columns). My usual companion, read.table, has failed me. I'm currently trying the package ff, through this procedure:

> df <- read.delim.ffdf(file       = "data.tsv",
+                       header     = TRUE,
+                       VERBOSE    = TRUE,
+                       first.rows = 1e3,
+                       next.rows  = 1e6,
+                       na.strings = c("", NA),
+                       colClasses = c("NUMERO_PROCESSO" = "factor"))

哪个可以很好地处理约600万条记录,但随后出现一个错误,如您所见:

Which works fine for about 6 million records, but then I get an error, as you can see:

read.table.ffdf 1..1000 (1000) csv-read=0.14sec ffdf-write=0.2sec
read.table.ffdf 1001..1001000 (1000000) csv-read=240.92sec ffdf-write=67.32sec
read.table.ffdf 1001001..2001000 (1000000) csv-read=179.15sec ffdf-write=94.13sec
read.table.ffdf 2001001..3001000 (1000000) csv-read=792.36sec ffdf-write=68.89sec
read.table.ffdf 3001001..4001000 (1000000) csv-read=192.57sec ffdf-write=83.26sec
read.table.ffdf 4001001..5001000 (1000000) csv-read=187.23sec ffdf-write=78.45sec
read.table.ffdf 5001001..6001000 (1000000) csv-read=193.91sec ffdf-write=94.01sec
read.table.ffdf 6001001..
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

如果我没记错的话,R抱怨读取数据的内存不足,但是read...ffdf过程不是应该在读取数据时规避大量内存使用的问题吗?我在这里怎么可能做错了?

If I'm not mistaken, R is complaining of lack of memory to read the data, but wasn't the read...ffdf procedure supposed to circumvent heavy memory usage when reading data? What could I be doing wrong here?

推荐答案

(我意识到这是一个老问题,但是我遇到了同样的问题,花了两天时间寻找解决方案.记录下我后来想出的东西.)

(I realize this is an old question, but I had the same problem and spent two days looking for the solution. This seems as good a place as any to document what I eventually figured out for posterity.)

问题不在于您的可用内存不足.问题是您已达到单个字符串的内存限制.从帮助(内存限制"):

The problem isn't that you are running out of available memory. The problem is that you've hit the memory limit for a single string. From help('Memory-limits'):

单个对象也有限制.存储空间不能超过地址限制,并且如果您尝试超过该限制,则错误消息开始时将无法分配长度向量. 字符串中的字节数限制为2 ^ 31-1〜2 * 10 ^ 9 ,这也是数组每个维度的限制.

There are also limits on individual objects. The storage space cannot exceed the address limit, and if you try to exceed that limit, the error message begins cannot allocate vector of length. The number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9, which is also the limit on each dimension of an array.

就我而言(似乎也是您的情况),因为我正在处理制表符分隔的数据,所以我没有费心设置引号字符,并且我认为这没关系.但是,在数据集的中间某处,我有一个带不匹配引号的字符串,然后read.table高兴地越过了行尾,一直到下一个,下一个和下一个...直到它达到了字符串大小的极限并爆炸了.

In my case (and it appears yours as well) I didn't bother to set the quote character since I was dealing with tab separated data and I assumed it didn't matter. However, somewhere in the middle of the data set, I had a string with an unmatched quote, and then read.table happily ran right past the end of line and on to the next, and the next, and the next... until it hit the limit for the size of a string and blew up.

解决方案是在参数列表中显式设置quote = "".

The solution was to explicitly set quote = "" in the argument list.

这篇关于R无法在ff过程上分配内存.怎么会?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆