如何使用R" readLines"读取大文件中的选定行?命令并将它们写入数据框? [英] How can I read selected rows from a large file using the R "readLines" command and write them to a data frame?

查看:562
本文介绍了如何使用R" readLines"读取大文件中的选定行?命令并将它们写入数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从事数据清理工作。我有一个函数可以识别大型输入文件中的坏行(太大而无法一次性读取,给定我的ram大小)并将坏行的行号作为向量返回 badRows 。这个功能似乎有效。

I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows. This function seems to work.

我现在正试图将坏行读入数据帧,到目前为止还没有成功。

I am now trying to read just the bad rows into a data frame, so far unsuccessfully.

我目前的方法是在打开的文件连接上使用 read.table ,使用的向量要读取的每一行之间要跳过的行数。对于连续的坏行,此数字为零。

My current approach is to use read.table on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.

我计算 skipVec as:

(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1

但目前我只是将我的函数交给一个 skipVec 全零的向量。

But for the moment I am just handing my function a skipVec vector of all zeros.

如果我的逻辑是正确的,这应该返回所有行。它不会。而是我得到一个错误:

If my logic is correct, this should return all the rows. It does not. Instead I get an error:


错误在read.table中(con,skip = pass,nrow = 1,header = TRUE,sep =
):输入中没有可用的行

"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input"

我目前的功能基于Miron Kursa(mbq)的功能,我发现这里

My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.

我的问题有点重复,但我认为他的功能有效,所以我以某种方式打破了它。我仍然试图理解打开文件和打开连接之间的区别一个文件,我怀疑问题出在某处,或者我使用的是 lapply

My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply.

我在RStudio 0.97.551下运行R 3.0.1,在一台老式Windows XP SP3机器上运行3gig ram。石器时代,我知道。

I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.

以下是产生上述错误消息的代码:

Here is the code that produces the error message above:

# Make a small small test data frame, write it to a file, and read it back in 
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))  
testThis.DF 

# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF  <- lapply(skipVec, FUN=function(pass){
  read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)

错误发生在关闭命令之前。如果我将readLines命令从lapply和函数中拉出来并且只是自己粘贴它,我仍然会得到相同的错误。

The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.

推荐答案

如果不是运行 read.table 通过 lapply 而不是手动运行前几次迭代,你会看到发生了什么:

If instead of running read.table through lapply you just run the first few iterations manually, you will see what is going on:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  nnn fff
1   2  aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
  X2 X3 bb
1  3  5 cc

因为 header = TRUE 它不是每次迭代时读取的一行而是两行,所以你最终会比你想象的更快地耗尽线,这是第三次迭代:

Because header = TRUE it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:

> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") : 
  no lines available in input

现在这可能仍然不是很有效解决问题的方法,但这是你可以修复当前代码的方法:

Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:

write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
  line <- read.table(con, nrow = 1, header = FALSE, sep = "",
                     row.names = 1)
  if (pass) NULL else line
  })
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)

一些提高速度的线索:


  1. 使用扫描而不是函数read.table 。以字符读取数据,并且仅在最后,在将数据放入字符矩阵或data.frame后,应用 type.convert 到每一列。

  2. 而不是循环遍历 skipVec ,循环遍历 rle 如果它短得多。因此,您将能够一次读取或跳过大量的行。

  1. use scan instead of read.table. Read data as character and only at the end, after you have put your data into a character matrix or data.frame, apply type.convert to each column.
  2. Instead of looping over skipVec, loop over its rle if it is much shorter. So you'll be able to read or skip chunks of lines at a time.

这篇关于如何使用R&quot; readLines&quot;读取大文件中的选定行?命令并将它们写入数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆