追加多个大的data.table;使用 colClasses 和 fread 自定义数据强制;命名管道 [英] append multiple large data.table's; custom data coercion using colClasses and fread; named pipes

查看:25
本文介绍了追加多个大的data.table;使用 colClasses 和 fread 自定义数据强制;命名管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

[这是一个帖子中的多个错误报告/功能请求,但它们不一定单独有意义.提前为怪物帖子道歉.按照帮助(data.table)的建议在此处发布.另外,我是 R 的新手;如果我没有在下面的代码中遵循最佳实践,我深表歉意.我正在尝试.]

1.rbindlist 在 6 * 8GB 文件上崩溃(我有 128GB RAM)

首先我想报告一下,使用 rbindlist 附加大型 data.tables 会导致 R 出现段错误(ubuntu 13.10,打包的 R 版本 3.0.1-3ubuntu1,从 CRAN 的 R 中安装的 data.table).该机器有 128 GiB 的 RAM;所以,考虑到数据的大小,我不应该耗尽内存.

我的代码:

append.tables <- 函数(文件){move.by.year <- lapply(files, fread)移动 <- rbindlist(moves.by.year)rm(移动.by.year)移动[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]返回(移动)}

崩溃消息:

 append.tables 崩溃了:>system.time(move <- append.tables(files))*** 发现段错误 ***地址 0x7f8e88dc1d10,导致内存未映射"追溯:1: rbindlist(moves.by.year)2:append.tables(文件)3: system.time(move <- append.tables(files))

有 6 个文件,每个文件长约 8 GiB 或 1 亿行,有 8 个变量,制表符分隔.

2.fread 可以接受多个文件名吗?

无论如何,我认为这里更好的方法是允许 fread 将文件作为文件名的向量:

files <- c("my", "files", "to be", "appended")dt <- fread(文件)

据推测,与不必同时保留所有这些对象(作为 R 用户似乎是必要的)相比,您在幕后的内存效率要高得多.

3.colClasses 给出错误信息

我的第二个问题是我需要指定 我的一种数据类型的自定义强制处理程序,但失败了:

dt <- fread(tfile, colClasses=list(date="myDate"))fread(tfile, colClasses = list(date = "myDate")) 中的错误:在数据中找不到 colClasses 中的列名myDate"

是的,在日期的情况下,一个简单的:

 dt[,date := as.Date(as.character(date), format="%Y%m%d")]

有效.

但是,我有一个不同的用例,即在从字符转换之前从其中一个数据列中去除小数点.这里的精度非常重要(因此我们需要使用整数类型),从 double 类型强制转换为整数会导致精度丢失.

现在,我可以通过一些 system() 调用来附加文件并通过一些 sed 魔术(此处简化)(其中 tfile 是另一个临时文件)来解决此问题:

if (has_header) {tfile2 <- 临时文件()系统(粘贴(回声假线>>",tfile2))系统(粘贴(head -q -n1",文件[[1]],>>",tfile2))system(paste("tail -q -n+2", tfile2, paste(files, collapse="")," | sed 's/\\.//' >>", tfile), wait=wait)取消链接(tfile2)} 别的 {系统(粘贴(猫",粘贴(文件,折叠="),>>",tfile),等待=等待)}

但这涉及额外的读/写周期.我有 4 TiB 的数据要处理,这是很多额外的读取和写入(不,不是全部都在一个 data.table 中.大约有 1000 个.)

4.fread 认为命名管道是空文件

我通常让wait=TRUE.但是我试图通过将 tfile 设为命名管道 system('mkfifo', tfile),设置 wait=FALSE,然后运行 ​​fread(tfile) 来查看是否可以避免额外的读/写周期.但是,fread 抱怨管道是一个空文件:

system(paste("tail -q -n+2", tfile2, paste(files, collapse=" ")," | sed 's/\\.//' >>", tfile), wait=FALSE)移动 <- fread(tfile)fread(tfile) 中的错误:文件为空:/tmp/RtmpbxNI1L/file78a678dc1999

无论如何,这有点像黑客.

如果我有我的愿望清单,则简化代码

理想情况下,我可以这样做:

setClass("Int_Price")setAs("character", "Int_Price",功能(从){return(as.integer(gsub("\\.", "", from)))})dt <- fread(files, colClasses=list(price="Int_Price"))

然后我会有一个很长的 data.table 和正确的强制数据.

解决方案

更新:rbindlist 错误已在 commit 1100 v1.8.11.来自新闻:

<块引用>

o 修复了在 >250m 行上发生的罕见段错误(内存分配期间的整数溢出);关闭 #5305.感谢 Guenter J. Hitsch 的报道.

<小时>

如评论中所述,您应该分别提出不同的问题.不过既然说的这么好,最后连成一个愿望,好吧,一一解答.

1.rbindlist 在 6 * 8GB 文件上崩溃(我有 128GB RAM)

请再次运行更改行:

moves.by.year <- lapply(files, fread)

moves.by.year <- lapply(files, fread,verbose=TRUE)

并将输出发送给我.我不认为这是文件的大小,而是关于类型和内容的问题.freadrbindlist 在 128GB 的​​盒子上加载 48GB 的​​数据应该没有问题,这是对的.正如您所说,lapply 应该返回 48GB,然后 rbindlist 应该创建一个新的 48GB 单表.这应该适用于您的 128GB 机器,因为 96GB <128GB.1 亿行 * 6 是 6 亿行,远低于 20 亿行的限制,所以应该没问题(data.table 还没有赶上 R3 中的长向量支持,否则 > 2^31 行也可以).

2.fread 可以接受多个文件名吗?

好主意.正如您所说,fread 可以首先扫描所有 6 个文件,检测它们的类型并计算总行数.然后直接为6亿行分配一次.这将节省不必要的 48GB RAM.它也可能在开始读取第一个文件之前检测到第 5 个或第 6 个文件(比如)中的任何异常,因此在出现问题时会更快地返回.

我会将其作为功能请求提交并在此处发布链接.

3.colClasses 给出错误信息

当输入 list 时,类型出现在 = 的左边,列名或位置的向量出现在右边.这个想法比 read.csv 中的 colClasses 更容易,后者只接受一个向量;一遍又一遍地保存重复的 "character" .我可以发誓这在 ?fread 中有更好的记录,但似乎没有.我去看看.

所以,而不是

fread(tfile, colClasses=list(date="myDate"))fread(tfile, colClasses = list(date = "myDate")) 中的错误:在数据中找不到 colClasses 中的列名myDate"

正确的语法是

fread(tfile, colClasses=list(myDate="date"))

鉴于您在问题中继续说的内容,iiuc,您实际上想要:

fread(tfile, colClasses=list(character="date")) # 只是 fread 接受列表

fread(tfile, colClasses=c("date"="character")) # read.csv 和 fread

其中任何一个都应该将名为date"的列加载为字符,以便您可以在强制之前对其进行操作.如果它真的只是日期,那么我仍然会自动实现该强制.你提到了 numeric 的精度,所以提醒一下 integer64 也可以被 fread 直接读取.

4.fread 认为命名管道是空文件

如果前一点得到解决,希望现在这会消失吗?fread 通过内存映射其输入来工作.它可以接受非文件,例如 http 地址和连接 (tbc),为了方便起见,它首先要做的是将完整的输入写入 ramdisk,以便它可以从那里映射输入.fread 速度快的原因与首先查看整个输入密切相关.

[This is kind of multiple bug-reports/feature requests in one post, but they don't necessarily make sense in isolation. Apologies for the monster post in advance. Posting here as suggested by help(data.table). Also, I'm new to R; so apologies if I'm not following best practices in my code below. I'm trying.]

1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)

First I want to report that using rbindlist to append large data.tables causes R to segfault (ubuntu 13.10, the packaged R version 3.0.1-3ubuntu1, data.table installed from within R from CRAN). The machine has 128 GiB of RAM; so, I shouldn't be running out of memory given the size of the data.

My code:

append.tables <- function(files) {
    moves.by.year <- lapply(files, fread)
    move <- rbindlist(moves.by.year)
    rm(moves.by.year)
    move[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]
    return(move)
}

Crash message:

 append.tables crashes with this:
> system.time(move <- append.tables(files))
 *** caught segfault ***
address 0x7f8e88dc1d10, cause 'memory not mapped'

Traceback:
 1: rbindlist(moves.by.year)
 2: append.tables(files)
 3: system.time(move <- append.tables(files))

There are 6 files, each about 8 GiB or 100 million lines long with 8 variables, tab separated.

2. Could fread accept multiple file names?

In any case, I think a better approach here would be to allow fread to take files as a vector of file names:

files <- c("my", "files", "to be", "appended")
dt <- fread(files)

Presumably you can be much more memory efficient under the hood than without having to keep all of these objects around at the same time as appears to be necessary as a user of R.

3. colClasses gives an error message

My second problem is that I need to specify a custom coercion handler for one of my data types, but that fails:

dt <- fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) : 
  Column name 'myDate' in colClasses not found in data

Yes, in the case of dates, a simple:

    dt[,date := as.Date(as.character(date), format="%Y%m%d")]

works.

However, I have a different use case, which is to strip the decimal point from one of the data columns before it is converted from a character. Precision here is extremely important (thus our need for using the integer type), and coercing to an integer from the double type results in lost precision.

Now, I can get around this with some system() calls to append the files and pipe them through some sed magic (simplified here) (where tfile is another temporary file):

if (has_header) {
    tfile2 <- tempfile()
    system(paste("echo fakeline >>", tfile2))
    system(paste("head -q -n1", files[[1]], ">>", tfile2))
    system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
                 " | sed 's/\\.//' >>", tfile), wait=wait)
    unlink(tfile2)
} else {
    system(paste("cat", paste(files, collapse=" "), ">>", tfile), wait=wait)
}

but this involves an extra read/write cycle. I have 4 TiB of data to process, which is a LOT of extra reading and writing (no, not all into one data.table. About 1000 of them.)

4. fread thinks named pipes are empty files

I typically leave wait=TRUE. But I was trying to see if I could avoid the extra read/write cycle by making tfile a named pipe system('mkfifo', tfile), setting wait=FALSE, and then running fread(tfile). However, fread complains about the pipe being an empty file:

system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
             " | sed 's/\\.//' >>", tfile), wait=FALSE)
move <- fread(tfile)
Error in fread(tfile) : File is empty: /tmp/RtmpbxNI1L/file78a678dc1999

In any case, this is a bit of a hack.

Simplified Code if I had my wish list

Ideally, I would be able to do something like this:

setClass("Int_Price")
setAs("character", "Int_Price",
    function (from) {
        return(as.integer(gsub("\\.", "", from)))
    }
)

dt <- fread(files, colClasses=list(price="Int_Price"))

And then I'd have a nice long data.table with properly coerced data.

解决方案

Update: The rbindlist bug has been fixed in commit 1100 v1.8.11. From NEWS:

o Fixed a rare segfault that occurred on >250m rows (integer overflow during memory allocation); closes #5305. Thanks to Guenter J. Hitsch for reporting.


As mentioned in comments, you're supposed to ask separate questions separately. But since they're such good points and linked together into the wish at the end, ok, will answer in one go.

1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)

Please run again changing the line :

moves.by.year <- lapply(files, fread)

to

moves.by.year <- lapply(files, fread, verbose=TRUE)

and send me the output. I don't think it is the size of the files, but something about the type and contents. You're right that fread and rbindlist should have no issue loading the 48GB of data on your 128GB box. As you say, the lapply should return 48GB and then the rbindlist should create a new 48GB single table. This should work on your 128GB machine since 96GB < 128GB. 100 million rows * 6 is 600 million rows, which is well under the 2 billion row limit so should be fine (data.table hasn't caught up with long vector support in R3 yet, otherwise > 2^31 rows would be fine, too).

2. Could fread accept multiple file names?

Excellent idea. As you say, fread could then sweep through all 6 files detecting their types and counting the total number of rows, first. Then allocate once for the 600 million rows directly. This would save churning through 48GB of RAM needlessly. It might also detect any anomalies in the 5th or 6th file (say) before starting to read the first files, so would return quicker in the event of problems.

I'll file this as a feature request and post the link here.

3. colClasses gives an error message

When type list, the type appears to the left of the = and a vector of column names or positions appears to the right. The idea is to be easier than colClasses in read.csv which only accepts a vector; to save repeating "character" over and over. I could have sworn this was better documented in ?fread but it seems not. I'll take a look at that.

So, instead of

fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) : 
    Column name 'myDate' in colClasses not found in data

the correct syntax is

fread(tfile, colClasses=list(myDate="date"))

Given what you go on to say in the question, iiuc, you actually want :

fread(tfile, colClasses=list(character="date"))  # just fread accepts list

or

fread(tfile, colClasses=c("date"="character"))   # both read.csv and fread

Either of those should load the column called "date" as character so you can manipulate it before coercion. If it really is just dates, then I've still to implement that coercion automatically. You mentioned precision of numeric so just to remind that integer64 can be read directly by fread too.

4. fread thinks named pipes are empty files

Hopefully this goes away now assuming the previous point is resolved? fread works by memory mapping its input. It can accept non-files such as http addresses and connections (tbc) and what it does first for convenience is to write the complete input to ramdisk so it can map the input from there. The reason fread is fast is hand in hand with seeing the entire input first.

这篇关于追加多个大的data.table;使用 colClasses 和 fread 自定义数据强制;命名管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆