如何有效地从文本文件的每一行读取第一个字符? [英] How to efficiently read the first character from each line of a text file?
问题描述
下面是一个示例文件:
x < - c(
pre>
Afklgjsdf; bosfu09 [45y94hn9igf,
Basfgsdbsfgn,
Djakfl09w50968509,
E3434t
)
writeLines(x,test.txt)
我可以用
readLines
并使用substring
得到第一个字符:
lines < - readLines(test.txt)
substring(lines,1,1)
## [1]ABC DE
有没有办法说服R只读第一个字符,而不是放弃它们?
我怀疑应该有一些咒语使用
scan
,但我可以找不到。一个替代方案可能是低级别的文件操作(也许有seek
)。
由于性能只与较大的文件相关,因此,用于基准测试的更大的测试文件:
set.seed(2015)
nch < - sample(1:100 ,1e4,replace = TRUE)
x2 < - vapply(
nch,
function(nch)
{
paste0(
sample(letters, nch,replace = TRUE),
collapse =
)
},
character(1)
)
writeLines(x2,bigtest。 txt)
更新:好像可以不要扫描整个文件。最好的速度增长似乎是使用一个更快的替代方案
readLines
( Richard Scriven'sstringi :: stri_read_lines
解决方案和 Josh O'Brien的data.table :: fread
解决方案),或将文件视为二进制文件( Martin Morgan的readBin
解决方案)。
解决方案04/2015编辑将更好的解决方案带到最前面。
更新2 更改在一个打开的连接上运行
scan()
方法,而不是在每次迭代时打开和关闭,允许逐行读取并消除循环。时机改善了不少。## scan()打开连接
conn < - file(bigtest.txt,rt )
substr(scan(conn,what =,sep =\\\
,quiet = TRUE),1,
close(conn)
我还在 stringi 中发现了
stri_read_lines()
它的帮助文件说这是目前的实验,但速度非常快。
$ b $## stringi :: stri_read_lines()
library(stringi)
stri_sub(stri_read_lines(bigtest .txt),1,1)
以下是这两种方法的时间点。 >
##计时
library(microbenchmark)
microbenchmark(
scan = {
conn< - file(bigtest.txt,rt)
substr(scan(conn,what =,sep =\\\
,quiet = TRUE),1,1 )
close(conn)
},
stringi = {
stri_sub(stri_read_lines(bigtest.txt),1,1)
}
)
#单位:毫秒
#expr分钟lq平均中位数uq max neval
#scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100
#stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100
原始[较慢]回答:
您可以尝试 read.fwf()
(fixed widt h文件),将宽度设置为1,以捕获每行的第一个字符。
pre $ read $ f $ f $ [1]ABCDE
当然,但是适用于测试文件,并且是获取子字符串而不必读取整个文件的一个很好的函数。
更新1 read.fwf()
不是很有效,调用 scan()
和 read.table()
内部。我们可以跳过中间人,直接尝试 scan()
。
lines < - count.fields(test.txt)## length是文件$ b $中的行数b skip < - seq_along(lines) - 1 ##为scan()设置'skip'arg()
读< - 函数(n){
ch < - scan(test。 (跳过,读取,字符(1L)),这是什么意思? ))
#[1]ABCDE
$ hr
$ $ $ $ $ $ $ $ $ $ $ $
$ [$] code>
I'd like to read only the first character from each line of a text file, ignoring the rest.
Here's an example file:
x <- c(
"Afklgjsdf;bosfu09[45y94hn9igf",
"Basfgsdbsfgn",
"Cajvw58723895yubjsdw409t809t80",
"Djakfl09w50968509",
"E3434t"
)
writeLines(x, "test.txt")
I can solve the problem by reading everything with readLines
and using substring
to get the first character:
lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"
This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?
I suspect that there ought to be some incantation using scan
, but I can't find it. An alternative might be low level file manipulation (maybe with seek
).
Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:
set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)
x2 <- vapply(
nch,
function(nch)
{
paste0(
sample(letters, nch, replace = TRUE),
collapse = ""
)
},
character(1)
)
writeLines(x2, "bigtest.txt")
Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines
(Richard Scriven's stringi::stri_read_lines
solution and Josh O'Brien's data.table::fread
solution), or to treat the file as binary (Martin Morgan's readBin
solution).
01/04/2015 Edited to bring the better solution to the top.
Update 2 Changing the scan()
method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.
## scan() on open connection
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
I also discovered the stri_read_lines()
function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.
## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
Here are the timings for these two methods.
## timings
library(microbenchmark)
microbenchmark(
scan = {
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
},
stringi = {
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100
Original [slower] answer :
You could try read.fwf()
(fixed width file), setting the width to a single 1 to capture the first character on each line.
read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"
Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.
Update 1 : read.fwf()
is not very efficient, calling scan()
and read.table()
internally. We can skip the middle-men and try scan()
directly.
lines <- count.fields("test.txt") ## length is num of lines in file
skip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan()
read <- function(n) {
ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"
version$platform
# [1] "x86_64-pc-linux-gnu"
这篇关于如何有效地从文本文件的每一行读取第一个字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!