读取CSV文件,使用连接在各行之间循环 [英] Reading a CSV file, looping through the rows, using connections
问题描述
所以我有一个很大的csv excel文件,如果没有rstudio终止,我的计算机将无法处理打开.
So I have a large csv excel file that my computer cannot handle opening without rstudio terminating.
为解决这个问题,我试图遍历文件的各行,以便一次对每一行进行计算,然后再存储该值,然后继续进行下一行.
To solve this I am trying to iterate through the rows of the file in order do my calculations on each row at a time, before storing the value and then moving on to the next row.
通常我可以通过简单地在Rstudio中读取和存储整个csv文件并运行一个简单的for循环来实现(例如,在较小的文件上).
This I can normally achieve (eg on a smaller file) through simply reading and storing the whole csv file within Rstudio and running a simple for loop.
但是,这是我要避免的这种数据存储的大小,因此我试图一次一次读取一行csv文件.
It is, however, the size of this storage of data that I am trying to avoid, hence I am trying to read a row of the csv file one at a time instead.
(我认为这很有意义)
建议:此处
我设法使我的计算结果可以读取并快速处理数据文件的第一行.
I have managed to get my calculations to be read and work quickly for the first row of my data file.
这是我正在努力解决的循环,因为我尝试使用for循环(可能应该使用while/if语句),但是我无处可从内部调用"i"值循环:我的代码的一部分在下面:
It is the looping over this that I am struggling with, as I am trying to use a for loop (potentially should be using a while/if statement) but I have nowhere for the "i" value to be called from within the loop: part of my code is below:
con = file(FileName, "r")
for (row in 1:nrow(con)) {
data <- read.csv(con, nrow=1) #reading of file
"insert calculations here"
}
因此不会调用"row"
,因此循环仅执行一次.我也遇到了"1:nrow(con)"
的问题,显然nrow(con)
只是返回了NULL
So the "row"
is not called upon so the loop only goes through once. I also have an issue with the "1:nrow(con)"
as clearly the nrow(con)
simply returns NULL
任何对此的帮助都会很棒, 谢谢.
Any help with this would be great, thanks.
推荐答案
read.csv()
如果尝试读取文件末尾的内容,则会产生错误.因此,您可以执行以下操作:
read.csv()
will generate an error if it tries to read past the end of the file. So you could do something like this:
con <- file(FileName, "rt")
repeat {
data <- try(read.csv(con, nrow = 1, header = FALSE), silent = TRUE) #reading of file
if (inherits(data, "try-error")) break
"insert calculations here"
}
close(con)
一次只能很慢地运行一行,但是如果您的计算代码支持的话,可以批量进行.并且我建议在read.csv()
调用中使用colClasses
指定列类型,以使R有时不会有所不同.
It will be really slow going one line at a time, but you can do it in larger batches if your calculation code supports that. And I'd recommend specifying the column types using colClasses
in the read.csv()
call, so that R doesn't guess differently sometimes.
编辑后添加:
我们被告知,数据集中有3000列整数.第一行仅具有部分标题信息.这段代码可以解决这个问题:
We've been told that there are 3000 columns of integers in the dataset. The first row only has partial header information. This code can deal with that:
n <- 1 # desired batch size
col.names <- paste0("C", 1:3000) # desired column names
con <- file(FileName, "rt")
readLines(con, 1) # Skip over bad header row
repeat {
data <- try(read.csv(con, nrow = n, header = FALSE,
col.names = col.names,
colClasses = "integer"),
silent = TRUE) #reading of file
if (inherits(data, "try-error")) break
"insert calculations here"
}
close(con)
这篇关于读取CSV文件,使用连接在各行之间循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!