解析txt文件并提取R中的信息 [英] Parsing txt files and extracting information in R

查看:213
本文介绍了解析txt文件并提取R中的信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从文本文件中提取信息,文件之间的结构不同.尽管这可以使用宏来完成,但由于文件是可变的,因此请按行号进行选择.并且一行中的间距并非对所有文件都成功.

I need to extract information from text files with varying structure between files. Whilst this can be done using a macro, as the files are variable, selecting by line no. and spacing within a line is not successful for all files.

我想知道是否有人可以告诉我是否有一种解析txt文件,按关键字搜索并在关键字之后提取信息的方法?例如,流速:99.99"之类的东西,我想提取99.99. 与此相关的另一个问题是,使用流速"示例,流速"将在每个文件中多次出现.有没有办法别名/索引流率:这样我就可以选择第三次出现?

I was wondering if anyone could tell me if there is a way of parsing txt files and searching by keyword and extracting information after the keyword? For example something like Flow Rate: 99.99, I would want to extract the 99.99. Another issue with this that, using the Flow Rate example, Flow Rate would appear numerous times in each file. Is there a way to alias/index Flow Rate: so that I can select, say, on the third occurrence?

任何提示或技巧都将受到欢迎.我知道识别关键字后如何打印整行,但不知道如何处理多次出现,而仅选择关键字后的数字:

Any hints or tips would be welcome. I know how print the entire line when a keyword is identified, but not how to deal with multiple occurrences, and to only select the number after the keyword:

all_data = readLines("Unit 5 2013.txt")
hours_of_operation <- grep("Annual Hours of Operation:    ",all_data)
all_data[hours_of_operation]
[1] "    Annual Hours of Operation:    8760.0 hours/yr"

谢谢

J

推荐答案

以下内容可能会有所帮助.我假设您已将文字带入字符向量

The following may help. I assume that you brought your text to character vector(s)

数据示例

注意:如果"Flow Rate"为大写字母,则可能要先使用tolower(ex)

Note: If "Flow Rate" is in capitals you may want to use first tolower(ex)

ex<-c("The annual observed flow rate: 99.99")

正则表达式&匹配项

此处regexpr在句点之前和之后搜索带有两位数的数字.

Here regexpr searches for a number with two digits before and after the period.

res<-regmatches(ex, regexpr("[0-9]{1,2}.[0-9]{1,2}",ex))

使用位置参数

另一种方法是使用库cwhmisc.该解决方案搜索单词"rate"的开始位置.预计以后会有5个职位,您可以在其中将该子字符串细分为子字符串.

Another way to do it is to use the library cwhmisc. This solution searches for the start position of the word "rate". Expecting 5 positions later the number you need you may then substring that number.

library(cwhmisc)
A<-cpos(ex,"rate", start=1) #position in string
res<-substr(ex, start=A+5, stop=A+9)

如果流量出现多次

将向量的元素拆分为子字符串,然后像以前一样捕获数字.

Split the elements of the vector into substrings and capture the numbers as before.

ex<-c("The annual observed flow rate: 99.99; the monthly flow rate: 90.03; the weekly observed flow rate: 92.22")
ndat<-unlist(strsplit(ex, "flow"))

这篇关于解析txt文件并提取R中的信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆