读取价值跨越多行的键值对的最有效的方法? [英] Most efficient way to read key value pairs where values span multiple lines?

查看:110
本文介绍了读取价值跨越多行的键值对的最有效的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将文本文件(如下面的示例)解析成两列 data.frame 的最快方法是什么,然后将其转换为宽格式? p>

What is the fastest way to parse a text file such as the example below into a two column data.frame which then then be transformed into a wide format?

FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
   Chiesa, Luca Maria
   Brizzolari, Andrea
   Santaniello, Enzo
   Passero, Elena
   Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
   chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
   AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015

使用 readLines 是有问题的,因为多行字段没有键。读取为固定宽度表也不行。建议?如果不是为了多行问题,这可以很容易地实现,每个行/记录的操作如下所示:

Using readLines is problematic because the multi-line fields don't have the keys. Reading as fixed width table also doesn't work. Suggestions? If not for the multiline issue, this would be easily accomplished with a function that operates on each row/record like so:

x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key                          value
1  FN Thomson Reuters Web of Science

注意:字段将始终为大写和两个字符。作者的整个标题和列表可以并入单个单元格。

Notes: The fields will always be uppercase and two characters. The entire title and list of authors can be concatenated into a single cell.

推荐答案

这是另一个想法,如果你想留在基地R,可能会很有用:

Here's another idea, that might be useful if you want to stay in base R:

parseEntry <- function(entry) {
    ## Split at beginning of each line that starts with a non-space character    
    ll <- strsplit(entry, "\\n(?=\\S)", perl=TRUE)[[1]]
    ## Clean up empty characters at beginning of continuation lines
    ll <- gsub("\\n(\\s){3}", "", ll)
    ## Split each field into its two components
    read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}

## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="\n")
## Parse the entry
parseEntry(ee)

这篇关于读取价值跨越多行的键值对的最有效的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆