如何有效地从文本文件的每一行读取第一个字符? [英] How to efficiently read the first character from each line of a text file?

查看:1203
本文介绍了如何有效地从文本文件的每一行读取第一个字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



下面是一个示例文件:

  x < -  c(
Afklgjsdf; bosfu09 [45y94hn9igf,
Basfgsdbsfgn,

Djakfl09w50968509,
E3434t

writeLines(x,test.txt)
pre>

我可以用 readLines 并使用 substring 得到第一个字符:

  lines < -  readLines(test.txt)
substring(lines,1,1)
## [1]ABC DE

有没有办法说服R只读第一个字符,而不是放弃它们?

我怀疑应该有一些咒语使用 scan ,但我可以找不到。一个替代方案可能是低级别的文件操作(也许有 seek )。



由于性能只与较大的文件相关,因此,用于基准测试的更大的测试文件:

  set.seed(2015)
nch < - sample(1:100 ,1e4,replace = TRUE)
x2 < - vapply(
nch,
function(nch)
{
paste0(
sample(letters, nch,replace = TRUE),
collapse =

},
character(1)

writeLines(x2,bigtest。 txt)






更新:好像可以不要扫描整个文件。最好的速度增长似乎是使用一个更快的替代方案 readLines Richard Scriven's stringi :: stri_read_lines 解决方案 Josh O'Brien的 data.table :: fread 解决方案),或将文件视为二进制文件( Martin Morgan的 readBin 解决方案)。

解决方案

04/2015编辑将更好的解决方案带到最前面。






更新2 更改在一个打开的连接上运行 scan()方法,而不是在每次迭代时打开和关闭,允许逐行读取并消除循环。时机改善了不少。

  ## scan()打开连接
conn < - file(bigtest.txt,rt )
substr(scan(conn,what =,sep =\\\
,quiet = TRUE),1,
close(conn)

我还在 stringi 中发现了 stri_read_lines()它的帮助文件说这是目前的实验,但速度非常快。

$ b $ ## stringi :: stri_read_lines()
library(stringi)
stri_sub(stri_read_lines(bigtest .txt),1,1)

以下是这两种方法的时间点。 >

  ##计时
library(microbenchmark)

microbenchmark(
scan = {
conn< - file(bigtest.txt,rt)
substr(scan(conn,what =,sep =\\\
,quiet = TRUE),1,1 )
close(conn)
},
stringi = {
stri_sub(stri_read_lines(bigtest.txt),1,1)
}

#单位:毫秒
#expr分钟lq平均中位数uq max neval
#scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100
#stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100






原始[较慢]回答:



您可以尝试 read.fwf()(fixed widt h文件),将宽度设置为1,以捕获每行的第一个字符。

pre $ read $ f $ f $ [1]ABCDE

当然,但是适用于测试文件,并且是获取子字符串而不必读取整个文件的一个很好的函数。






更新1 read.fwf()不是很有效,调用 scan() read.table()内部。我们可以跳过中间人,直接尝试 scan()

  lines < -  count.fields(test.txt)## length是文件$ b $中的行数b skip < -  seq_along(lines) -  1 ##为scan()设置'skip'arg()
读< - 函数(n){
ch < - scan(test。 (跳过,读取,字符(1L)),这是什么意思? ))
#[1]ABCDE



$ hr

$ $ $ $ $ $ $ $ $ $ $ $
$ [$] code>


I'd like to read only the first character from each line of a text file, ignoring the rest.

Here's an example file:

x <- c(
  "Afklgjsdf;bosfu09[45y94hn9igf",
  "Basfgsdbsfgn",
  "Cajvw58723895yubjsdw409t809t80",
  "Djakfl09w50968509",
  "E3434t"
)
writeLines(x, "test.txt")

I can solve the problem by reading everything with readLines and using substring to get the first character:

lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"

This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?

I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).


Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:

set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)    
x2 <- vapply(
  nch, 
  function(nch)
  {
    paste0(
      sample(letters, nch, replace = TRUE), 
      collapse = ""
    )    
  },
  character(1)
)
writeLines(x2, "bigtest.txt")


Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution).

解决方案

01/04/2015 Edited to bring the better solution to the top.


Update 2 Changing the scan() method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.

## scan() on open connection 
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)

I also discovered the stri_read_lines() function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.

## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)

Here are the timings for these two methods.

## timings
library(microbenchmark)

microbenchmark(
    scan = {
        conn <- file("bigtest.txt", "rt")
        substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
        close(conn)
    },
    stringi = {
        stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
    }
)
# Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval
#    scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646   100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421   100


Original [slower] answer :

You could try read.fwf() (fixed width file), setting the width to a single 1 to capture the first character on each line.

read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"

Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.


Update 1 : read.fwf() is not very efficient, calling scan() and read.table() internally. We can skip the middle-men and try scan() directly.

lines <- count.fields("test.txt")   ## length is num of lines in file
skip <- seq_along(lines) - 1        ## set up the 'skip' arg for scan()
read <- function(n) {
    ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
    substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"


version$platform
# [1] "x86_64-pc-linux-gnu"

这篇关于如何有效地从文本文件的每一行读取第一个字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆