如何有效地从文本文件的每一行读取第一个字符？ [英] How to efficiently read the first character from each line of a text file?

查看：1203 发布时间：2017/11/4 21:07:10 r file-io

本文介绍了如何有效地从文本文件的每一行读取第一个字符？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是一个示例文件：

x < - c（ Afklgjsdf; bosfu09 [45y94hn9igf， Basfgsdbsfgn， Djakfl09w50968509， E3434t ） writeLines（x，test.txt） pre>

我可以用 readLines 并使用 substring 得到第一个字符：

  lines < -  readLines（test.txt）
 substring（lines，1,1）
 ## [1]ABC DE

有没有办法说服R只读第一个字符，而不是放弃它们？

我怀疑应该有一些咒语使用 scan ，但我可以找不到。一个替代方案可能是低级别的文件操作（也许有 seek ）。

由于性能只与较大的文件相关，因此，用于基准测试的更大的测试文件：

  set.seed（2015）
 nch < -  sample（1：100 ，1e4，replace = TRUE）
 x2 < -  vapply（
 nch，
 function（nch）
 {
 paste0（
 sample（letters， nch，replace = TRUE），
 collapse =
）
}，
 character（1）
）
 writeLines（x2，bigtest。 txt）

更新：好像可以不要扫描整个文件。最好的速度增长似乎是使用一个更快的替代方案 readLines （ Richard Scriven's stringi :: stri_read_lines 解决方案和 Josh O'Brien的 data.table :: fread 解决方案），或将文件视为二进制文件（ Martin Morgan的 readBin 解决方案）。解决方案 04/2015编辑将更好的解决方案带到最前面。更新2 更改在一个打开的连接上运行 scan（）方法，而不是在每次迭代时打开和关闭，允许逐行读取并消除循环。时机改善了不少。 ## scan（）打开连接 conn < - file（bigtest.txt，rt ） substr（scan（conn，what =，sep =\\\ ，quiet = TRUE），1， close（conn）我还在 stringi 中发现了 stri_read_lines（）它的帮助文件说这是目前的实验，但速度非常快。 $ b $ ## stringi :: stri_read_lines（） library（stringi） stri_sub（stri_read_lines（bigtest .txt），1，1）

以下是这两种方法的时间点。 >

  ##计时
 library（microbenchmark）
 
 microbenchmark（
 scan = {
 conn<  -  file（bigtest.txt，rt）
 substr（scan（conn，what =，sep =\\\
，quiet = TRUE），1,1 ）
 close（conn）
}，
 stringi = {
 stri_sub（stri_read_lines（bigtest.txt），1，1）
} 
 ）
＃单位：毫秒
＃expr分钟lq平均中位数uq max neval 
＃scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100 
＃stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100

原始[较慢]回答：您可以尝试 read.fwf（）（fixed widt h文件），将宽度设置为1，以捕获每行的第一个字符。 pre $ read $ f $ f $ [1]ABCDE
当然，但是适用于测试文件，并且是获取子字符串而不必读取整个文件的一个很好的函数。

更新1 read.fwf（）不是很有效，调用 scan（）和 read.table（）内部。我们可以跳过中间人，直接尝试 scan（）。

lines < - count.fields（test.txt）## length是文件$ b $中的行数b skip < - seq_along（lines） - 1 ##为scan（）设置'skip'arg（）读< - 函数（n）{ ch < - scan（test。（跳过，读取，字符（1L）），这是什么意思？））＃[1]ABCDE

$ hr

$ $ $ $ $ $ $ $ $ $ $ $
$ [$] code>

I'd like to read only the first character from each line of a text file, ignoring the rest.

Here's an example file:
x <- c( "Afklgjsdf;bosfu09[45y94hn9igf", "Basfgsdbsfgn", "Cajvw58723895yubjsdw409t809t80", "Djakfl09w50968509", "E3434t" ) writeLines(x, "test.txt")
I can solve the problem by reading everything with readLines and using substring to get the first character:
lines <- readLines("test.txt") substring(lines, 1, 1) ## [1] "A" "B" "C" "D" "E"
This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?

I suspect that there ought to be some incantation using scan, but I can't find it. An alternative might be low level file manipulation (maybe with seek).

Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:
set.seed(2015) nch <- sample(1:100, 1e4, replace = TRUE) x2 <- vapply( nch, function(nch) { paste0( sample(letters, nch, replace = TRUE), collapse = "" ) }, character(1) ) writeLines(x2, "bigtest.txt")

Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines (Richard Scriven's stringi::stri_read_lines solution and Josh O'Brien's data.table::fread solution), or to treat the file as binary (Martin Morgan's readBin solution).
解决方案
01/04/2015 Edited to bring the better solution to the top.

Update 2 Changing the scan() method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.
## scan() on open connection conn <- file("bigtest.txt", "rt") substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1) close(conn)
I also discovered the stri_read_lines() function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.
## stringi::stri_read_lines() library(stringi) stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
Here are the timings for these two methods.
## timings library(microbenchmark) microbenchmark( scan = { conn <- file("bigtest.txt", "rt") substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1) close(conn) }, stringi = { stri_sub(stri_read_lines("bigtest.txt"), 1, 1) } ) # Unit: milliseconds # expr min lq mean median uq max neval # scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100 # stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100

Original [slower] answer :

You could try read.fwf() (fixed width file), setting the width to a single 1 to capture the first character on each line.
read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]] # [1] "A" "B" "C" "D" "E"
Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.

Update 1 : read.fwf() is not very efficient, calling scan() and read.table() internally. We can skip the middle-men and try scan() directly.
lines <- count.fields("test.txt") ## length is num of lines in file skip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan() read <- function(n) { ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE) substr(ch, 1, 1) } vapply(skip, read, character(1L)) # [1] "A" "B" "C" "D" "E"

version$platform # [1] "x86_64-pc-linux-gnu"

这篇关于如何有效地从文本文件的每一行读取第一个字符？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何有效地从文本文件的每一行读取第一个字符？ [英] How to efficiently read the first character from each line of a text file?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何有效地从文本文件的每一行读取第一个字符？ [英] How to efficiently read the first character from each line of a text file?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭