快速读取(通过块?)和在R中以规则间隔处理具有虚拟线的文件 [英] Fast reading (by chunk?) and processing of a file with dummy lines at regular interval in R
问题描述
我有一个文件,其中包含许多数组的常规数字输出(相同格式),每个数组由一行(包含一些信息)分隔。
例如:
I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info). For example:
library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
write(paste(i, 'is the current iteration'), myfile, append=T)
z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}
使用 nx = 5
和 ny = 2
,我会有一个这样的文件:
With nx=5
and ny=2
, I would have a file like this:
# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ...
我想尽可能快地读取连续数组,将它们放在一个 data.frame
中(实际上,我有数千个)。什么是最有效的方法?
I want to read the successive arrays as fast as possible to put them in a single data.frame
(in reality, I have thousands of them). What is the most efficient way to proceed?
鉴于输出是常规的,我认为 readr
会很好想法(?)。
我能想到的唯一方法是用块手动完成它以消除无用的信息行:
Given the output is regular, I thought readr
would be a good idea (?).
The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:
library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
z = as.vector(t(z))
ifirst = (i-1)*ny*nx + 1 # appropriate index
ztot[ifirst:(ifirst+nx*ny-1)] = z
}
# The arrays are actually spatial rasters. Compute the coordinates
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter)
y = rep(rep(seq(1:ny), each=nx), niter)
myDF = data.frame(x=x, y=y, z=z)
但这还不够快。我怎样才能更快地实现这一目标?
But this is not fast enough. How can I achieve this faster?
有没有办法一次读取所有内容并在之后删除无用的行?
Is there a way to read everything at once and delete the useless rows afterwards?
或者,是否没有读取函数接受具有精确位置的向量作为跳过
参数,而不是单个初始行数?
Alternatively, is there no reading function accepting a vector with precise locations as skip
argument, rather than a single number of initial rows?
PS:注意在不同目录中的许多文件(相同结构)上重复读取操作,以防它影响解决方案......
编辑
以下解决方案(阅读 readLines所有行
并删除不受欢迎的那些,然后处理其余的)是一个更快的替代方案 niter
非常高:
EDIT
The following solution (reading all lines with readLines
and removing the undesirable ones and then processing the rest) is a faster alternative with niter
very high:
bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))
可以找到关于如何从 readLines
继续结果的讨论其他
Discussion on how to proceed results from the readLines
can be found here
推荐答案
使用命令行工具预处理文件(即,不在 R
中)实际上更快。例如,使用 awk
:
Pre-processing the file with a command line tool (i.e., not in R
) is actually way faster. For example with awk
:
tmpfile <- 'cleanFile.txt'
mycommand <- paste("awk '!/is the current iteration/'", myfile, '>', tmpfile)
# "awk '!/is the current iteration/' bigFileWithRowsToSkip.txt > cleanFile.txt"
system(mycommand) # call the command from R
ztot <- fread(tmpfile)
ztot <- c(t(ztot))
可以基于 pattern 或 indices 。
这是@Roland从此处。
这篇关于快速读取(通过块?)和在R中以规则间隔处理具有虚拟线的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!