将R read.csv转换为readLines批处理? [英] Convert R read.csv to a readLines batch?

查看:438
本文介绍了将R read.csv转换为readLines批处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个合适的模型,我想用它来评分存储为CSV的新数据集。不幸的是,新数据集有点大,如果我一次完成所有操作,预测程序会耗尽内存。所以,我想将下面小集合工作正常的程序转换为一次处理500行的批处理模式,然后为每个得分500输出一个文件。

I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.

我从这个答案中理解(在R中逐行读取的好方法是什么?)我可以使用readLines。所以,我将转换自:

I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:

trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)

newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)

类似于:

trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)

con  <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
    i = i+1
        newdata <- as.data.frame(mylines, stringsAsFactors=F)
        preds <- predict(fit,newdata)
        write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)

然而,当我打印mylines obj时在循环内部,它没有得到正确的自动校正,就像read.csv产生的东西一样 - 标题仍然是一团糟,无论模数列宽在发动机罩下发生,将矢量包装成ncol对象没有发生。

However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.

每当我发现自己写下野蛮的东西,比如切割第一行,包裹柱子时,我通常会怀疑R有更好的办法。关于如何从readLines csv连接获得类似read.csv的输出的任何建议?

Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?

推荐答案

如果你想阅读使用 read.csv 使用 skip nrows将数据分块存储到内存中参数。在伪代码中:

If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:

read_chunk = function(start, n) {
   read.csv(file, skip = start, nrows = n)
 }

start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
   dat = read_chunk(x, chunk_size)
   pred = predict(fit, dat)
   write.csv(pred)
  }

或者,您可以将数据放入sqlite数据库,并使用 sqlite 包以块的形式查询数据。 此答案,或在SO上用 [r] large csv 进行一些挖掘。

Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.

这篇关于将R read.csv转换为readLines批处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆