快速读取（通过块？）和在R中以规则间隔处理具有虚拟线的文件 [英] Fast reading (by chunk?) and processing of a file with dummy lines at regular interval in R

查看：166 发布时间：2018/8/1 12:02:37 r import

本文介绍了快速读取（通过块？）和在R中以规则间隔处理具有虚拟线的文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件，其中包含许多数组的常规数字输出（相同格式），每个数组由一行（包含一些信息）分隔。
例如：

I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info). For example:

library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
  write(paste(i, 'is the current iteration'), myfile, append=T)
  z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
  write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}

使用 nx = 5 和 ny = 2 ，我会有一个这样的文件：

With nx=5 and ny=2, I would have a file like this:

# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ...

我想尽可能快地读取连续数组，将它们放在一个 data.frame 中（实际上，我有数千个）。什么是最有效的方法？

I want to read the successive arrays as fast as possible to put them in a single data.frame (in reality, I have thousands of them). What is the most efficient way to proceed?

鉴于输出是常规的，我认为 readr 会很好想法（？）。
我能想到的唯一方法是用块手动完成它以消除无用的信息行：

Given the output is regular, I thought readr would be a good idea (?). The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:

library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size 
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
  nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
  z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
  z = as.vector(t(z))
  ifirst = (i-1)*ny*nx + 1 # appropriate index
  ztot[ifirst:(ifirst+nx*ny-1)] = z
}

# The arrays are actually spatial rasters. Compute the coordinates 
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter) 
y = rep(rep(seq(1:ny), each=nx), niter) 

myDF = data.frame(x=x, y=y, z=z)

但这还不够快。我怎样才能更快地实现这一目标？

But this is not fast enough. How can I achieve this faster?

有没有办法一次读取所有内容并在之后删除无用的行？

Is there a way to read everything at once and delete the useless rows afterwards?

或者，是否没有读取函数接受具有精确位置的向量作为跳过参数，而不是单个初始行数？

Alternatively, is there no reading function accepting a vector with precise locations as skip argument, rather than a single number of initial rows?

PS：注意在不同目录中的许多文件（相同结构）上重复读取操作，以防它影响解决方案......

编辑
以下解决方案（阅读 readLines所有行并删除不受欢迎的那些，然后处理其余的）是一个更快的替代方案 niter 非常高：

EDIT The following solution (reading all lines with readLines and removing the undesirable ones and then processing the rest) is a faster alternative with niter very high:

bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))

可以找到关于如何从 readLines 继续结果的讨论其他

Discussion on how to proceed results from the readLines can be found here

快速读取（通过块？）和在R中以规则间隔处理具有虚拟线的文件 [英] Fast reading (by chunk?) and processing of a file with dummy lines at regular interval in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

快速读取（通过块？）和在R中以规则间隔处理具有虚拟线的文件 [英] Fast reading (by chunk?) and processing of a file with dummy lines at regular interval in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭