解析txt文件并提取R中的信息 [英] Parsing txt files and extracting information in R
问题描述
我需要从文本文件中提取信息,文件之间的结构不同.尽管这可以使用宏来完成,但由于文件是可变的,因此请按行号进行选择.并且一行中的间距并非对所有文件都成功.
I need to extract information from text files with varying structure between files. Whilst this can be done using a macro, as the files are variable, selecting by line no. and spacing within a line is not successful for all files.
我想知道是否有人可以告诉我是否有一种解析txt文件,按关键字搜索并在关键字之后提取信息的方法?例如,流速:99.99"之类的东西,我想提取99.99. 与此相关的另一个问题是,使用流速"示例,流速"将在每个文件中多次出现.有没有办法别名/索引流率:这样我就可以选择第三次出现?
I was wondering if anyone could tell me if there is a way of parsing txt files and searching by keyword and extracting information after the keyword? For example something like Flow Rate: 99.99, I would want to extract the 99.99. Another issue with this that, using the Flow Rate example, Flow Rate would appear numerous times in each file. Is there a way to alias/index Flow Rate: so that I can select, say, on the third occurrence?
任何提示或技巧都将受到欢迎.我知道识别关键字后如何打印整行,但不知道如何处理多次出现,而仅选择关键字后的数字:
Any hints or tips would be welcome. I know how print the entire line when a keyword is identified, but not how to deal with multiple occurrences, and to only select the number after the keyword:
all_data = readLines("Unit 5 2013.txt")
hours_of_operation <- grep("Annual Hours of Operation: ",all_data)
all_data[hours_of_operation]
[1] " Annual Hours of Operation: 8760.0 hours/yr"
谢谢
J
推荐答案
以下内容可能会有所帮助.我假设您已将文字带入字符向量
The following may help. I assume that you brought your text to character vector(s)
数据示例
注意:如果"Flow Rate"为大写字母,则可能要先使用tolower(ex)
Note: If "Flow Rate" is in capitals you may want to use first tolower(ex)
ex<-c("The annual observed flow rate: 99.99")
正则表达式&匹配项
此处regexpr在句点之前和之后搜索带有两位数的数字.
Here regexpr searches for a number with two digits before and after the period.
res<-regmatches(ex, regexpr("[0-9]{1,2}.[0-9]{1,2}",ex))
使用位置参数
另一种方法是使用库cwhmisc.该解决方案搜索单词"rate"的开始位置.预计以后会有5个职位,您可以在其中将该子字符串细分为子字符串.
Another way to do it is to use the library cwhmisc. This solution searches for the start position of the word "rate". Expecting 5 positions later the number you need you may then substring that number.
library(cwhmisc)
A<-cpos(ex,"rate", start=1) #position in string
res<-substr(ex, start=A+5, stop=A+9)
如果流量出现多次
将向量的元素拆分为子字符串,然后像以前一样捕获数字.
Split the elements of the vector into substrings and capture the numbers as before.
ex<-c("The annual observed flow rate: 99.99; the monthly flow rate: 90.03; the weekly observed flow rate: 92.22")
ndat<-unlist(strsplit(ex, "flow"))
这篇关于解析txt文件并提取R中的信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!