如何使用R" readLines"读取大文件中的选定行?命令并将它们写入数据框? [英] How can I read selected rows from a large file using the R "readLines" command and write them to a data frame?
问题描述
我从事数据清理工作。我有一个函数可以识别大型输入文件中的坏行(太大而无法一次性读取,给定我的ram大小)并将坏行的行号作为向量返回 badRows
。这个功能似乎有效。
I am engaged in data cleaning. I have a function that identifies bad rows in a large input file (too big to read at one go, given my ram size) and returns the row numbers of the bad rows as a vector badRows
. This function seems to work.
我现在正试图将坏行读入数据帧,到目前为止还没有成功。
I am now trying to read just the bad rows into a data frame, so far unsuccessfully.
我目前的方法是在打开的文件连接上使用 read.table
,使用的向量要读取的每一行之间要跳过的行数。对于连续的坏行,此数字为零。
My current approach is to use read.table
on an open connection to my file, using a vector of the number of rows to skip between each row that is read. This number is zero for consecutive bad rows.
我计算 skipVec
as:
(badRowNumbers - c(0, badRowNumbers[1:(length(badRowNumbers-1]))-1
但目前我只是将我的函数交给一个 skipVec
全零的向量。
But for the moment I am just handing my function a skipVec
vector of all zeros.
如果我的逻辑是正确的,这应该返回所有行。它不会。而是我得到一个错误:
If my logic is correct, this should return all the rows. It does not. Instead I get an error:
错误在read.table中(con,skip = pass,nrow = 1,header = TRUE,sep =
):输入中没有可用的行
"Error in read.table(con, skip = pass, nrow = 1, header = TRUE, sep = "") : no lines available in input"
我目前的功能基于Miron Kursa(mbq)的功能,我发现这里。
My current function is loosely based on a function by Miron Kursa ("mbq"), which I found here.
我的问题有点重复,但我认为他的功能有效,所以我以某种方式打破了它。我仍然试图理解打开文件和打开连接之间的区别一个文件,我怀疑问题出在某处,或者我使用的是 lapply
。
My question is somewhat duplicative of that one, but I assume his function works, so I have broken it somehow. I am still trying to understand the difference between opening a file and opening a connection to a file, and I suspect that the problem is there somewhere, or in my use of lapply
.
我在RStudio 0.97.551下运行R 3.0.1,在一台老式Windows XP SP3机器上运行3gig ram。石器时代,我知道。
I am running R 3.0.1 under RStudio 0.97.551 on a cranky old Windows XP SP3 machine with 3gig of ram. Stone Age, I know.
以下是产生上述错误消息的代码:
Here is the code that produces the error message above:
# Make a small small test data frame, write it to a file, and read it back in
# a row at a time.
testThis.DF <- data.frame(nnn=c(2,3,5), fff=c("aa", "bb", "cc"))
testThis.DF
# This function will work only if the number of bad rows is not too big for memory
write.table(testThis.DF, "testThis.DF")
con<-file("testThis.DF")
open(con)
skipVec <- c(0,0,0)
badRows.DF <- lapply(skipVec, FUN=function(pass){
read.table(con, skip=pass, nrow=1, header=TRUE, sep="") })
close(con)
错误发生在关闭命令之前。如果我将readLines命令从lapply和函数中拉出来并且只是自己粘贴它,我仍然会得到相同的错误。
The error occurs before the close command. If I yank the readLines command out of the lapply and the function and just stick it in by itself, I still get the same error.
推荐答案
如果不是运行 read.table
通过 lapply
而不是手动运行前几次迭代,你会看到发生了什么:
If instead of running read.table
through lapply
you just run the first few iterations manually, you will see what is going on:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
nnn fff
1 2 aa
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
X2 X3 bb
1 3 5 cc
因为 header = TRUE
它不是每次迭代时读取的一行而是两行,所以你最终会比你想象的更快地耗尽线,这是第三次迭代:
Because header = TRUE
it is not one line that is read at each iteration but two, so you eventually run out of lines faster than you think, here on the third iteration:
> read.table(con, skip=0, nrow=1, header=TRUE, sep="")
Error in read.table(con, skip = 0, nrow = 1, header = TRUE, sep = "") :
no lines available in input
现在这可能仍然不是很有效解决问题的方法,但这是你可以修复当前代码的方法:
Now this might still not be a very efficient way of solving your problem, but this is how you can fix your current code:
write.table(testThis.DF, "testThis.DF")
con <- file("testThis.DF")
open(con)
header <- scan(con, what = character(), nlines = 1, quiet = TRUE)
skipVec <- c(0,1,0)
badRows <- lapply(skipVec, function(pass){
line <- read.table(con, nrow = 1, header = FALSE, sep = "",
row.names = 1)
if (pass) NULL else line
})
badRows.DF <- setNames(do.call(rbind, badRows), header)
close(con)
一些提高速度的线索:
- 使用
扫描
而不是函数read.table
。以字符
读取数据,并且仅在最后,在将数据放入字符矩阵或data.frame后,应用type.convert
到每一列。 - 而不是循环遍历
skipVec
,循环遍历rle
如果它短得多。因此,您将能够一次读取或跳过大量的行。
- use
scan
instead ofread.table
. Read data ascharacter
and only at the end, after you have put your data into a character matrix or data.frame, applytype.convert
to each column. - Instead of looping over
skipVec
, loop over itsrle
if it is much shorter. So you'll be able to read or skip chunks of lines at a time.
这篇关于如何使用R" readLines"读取大文件中的选定行?命令并将它们写入数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!