可以在R中更改记录定界符吗? [英] Possible to change the record delimiter in R?

查看:97
本文介绍了可以在R中更改记录定界符吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从文本文件中读取数据(即read.table)时,是否有可能操纵记录/观察/行定界符?使用sep =调整字段定界符很简单,但是我还没有找到一种从行尾字符更改记录定界符的方法。

Is it possible to manipulate the record/observation/row delimiter when reading in data (i.e. read.table) from a text file? It's straightforward to adjust the field delimiter using sep="", but I haven't found a way to change the record delimiter from an end-of-line character.

我正在尝试读取管道分隔的文本文件,其中许多条目都是包含回车符的长字符串。 R将这些CR视为行尾,错误地开始了新行并弄乱了记录数和字段顺序。

I am trying to read in pipe delimited text files in which many of the entries are long strings that include carriage returns. R treats these CRs as end-of-line, which begins a new row incorrectly and screws up the number of records and field order.

我想使用其他分隔符而不是CR。事实证明,每一行都以相同的字符串开头,因此,如果我可以使用\nString之类的东西来标识 true 行尾,则该表将正确导入。这是一个文本文件的外观的简化示例。

I would like to use a different delimiter instead of a CR. As it turns out, each row begins with the same string, so if I could use use something like \nString to identify true end-of-line, the table would import correctly. Here's a simplified example of what one of the text files might look like.

V1,V2,V3,V4
String,A,5,some text
String,B,2,more text and
more text
String,B,7,some different text
String,A,,

应读入R为

V1      V2       V3      V4
String  A        5       some text
String  B        2       more text and more text
String  B        7       some different text
String  A        N/A     N/A

我可以在文本编辑器中打开文件,并在查找之前用查找/替换来清理它们R中的系统解决方案会很棒。谢谢您的帮助。

I can open the files in a text editor and clean them with a find/replace before reading in, but a systematic solution within R would be great. Thanks for your help.

推荐答案

我们可以阅读它们,然后将其折叠。 g的标头值为0,下一行的值为1(以及跟在后面的行,如果有的话),依此类推。 tapply 根据 g 折叠行,给出 L2 ,最后我们重新读取以下行:

We can read them in and collapse them afterwards. g will have the value 0 for the header, 1 for the next line (and for follow on lines, if any, that are to go with it) and so on. tapply collapses the lines according to g giving L2 and finally we re-read the lines:

Lines <- "V1,V2,V3,V4
String,A,5,some text
String,B,2,more text and
more text
String,B,7,some different text
String,A,,"

L <- readLines(textConnection(Lines))

g <- cumsum(grepl("^String", L))
L2 <- tapply(L, g, paste, collapse = " ")

DF <- read.csv(text = L2, as.is = TRUE)
DF$V4[ DF$V4 == "" ] <- NA

这给出:

> DF
      V1 V2 V3                      V4
1 String  A  5               some text
2 String  B  2 more text and more text
3 String  B  7     some different text
4 String  A NA                    <NA>

这篇关于可以在R中更改记录定界符吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆