读取以空格分隔的文本文件,其中第一列也有空格 [英] Reading a space delimited text file where first column also has spaces
问题描述
我正在尝试将一个文本文件读入 R,如下所示:
I'm trying to read a text file into R that looks like this:
Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87
等
我的问题是文件以空格分隔,第一个变量有一些包含空格的条目,因此读入 R 会由于不同的行有更多的列而产生错误.有谁知道读这个的方法吗?
My problem is that the file is space delimited and the first variable has some entries that include a space so reading into R creates an error due to different rows having more columns. Does anyone know a way to read this in?
推荐答案
Ben 的方法效果很好,但这里是另一种使用 `strapplyc
、gsubfn
或 的方法来自 gsubfn 包的捆绑
.
Ben's approach works great, but here is another approach using `strapplyc
, gsubfn
or strapply
from the gsubfn package.
首先读入数据并设置col.names
、要使用的分隔符和模式:
First read in the data and set col.names
, the separator and the pattern to use:
r <- readLines(textConnection(
"Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87"))
library(gsubfn)
col.names <- c("group", "x1", "x2", "x3")
sep <- "," # if comma can appear in fields use something else
pat <- "^(.*) +(\\d+) +(\\d+) +(\\d+) *$"
1) gsubfn
tmp <- sapply(strapplyc(r, pat), paste, collapse = sep)
read.table(text = tmp, col.names = col.names, as.is = TRUE, sep = sep)
2) strplyc 或者相同的代码,但最后两条语句被替换为:
2) strapplyc Alternately the same code but the last two statement are replaced with:
tmp <- gsubfn(pat, ... ~ paste(..., sep = sep), r)
read.table(text = tmp, col.names = col.names, as.is = TRUE, sep = sep)
3) 束带.这个和后面的变体不需要定义 sep
.
3) strapply. This one and the variation that follows do not require that sep
be defined.
library(data.table)
tmp <- strapply(r, pat,
~ data.table(
group = group,
x1 = as.numeric(x1),
x2 = as.numeric(x2),
x3 = as.numeric(x3)
))
rbindlist(tmp)
3a) 这个涉及一些额外的操作,所以我们可能更喜欢其他解决方案之一,但为了完整性,这里是.combine=list
防止单个输出被修改,simplify=c
删除了 combine=list
添加的额外层.最后我们rbind
把所有的东西都放在一起.
3a) This one involves some extra manipulation so we might favor one of the other solutions instead but for completeness here it is. The combine=list
prevents the individual outputs from being munged and the simplify=c
removes the extra layer that combine=list
added. Finally we rbind
everything together.
tmp <- strapply(r, pat,
~ data.frame(
group = group,
x1 = as.numeric(x1),
x2 = as.numeric(x2),
x3 = as.numeric(x3),
stringsAsFactors = FALSE
), combine = list, simplify = c)
do.call(rbind, tmp)
4) read.pattern gsubfn 包的开发版有一个新功能read.pattern 对这类问题特别直接:
4) read.pattern The development version of the gsubfn package has a new function read.pattern that is particularly direct for this type of problem:
library(devtools) # source_url
source_url("https://gsubfn.googlecode.com/svn/trunk/R/read.pattern.R") # from dev repo
read.pattern(text = r, pattern = pat, col.names = col.names, as.is = TRUE)
注意:这些方法有几个优点(尽管 Ben 的方法也可以针对这些情况进行修改).这种方法取最后 3 个数字之前的任何内容并将其用作第一个字段,因此如果第一个字段有 3 个或更多单词,或者其中一个单词"是一组数字(例如17 英寸蚂蚁农场"),那么它仍然会工作.
Note: These approaches have a couple of advantages (though Ben's approach could be modified for these cases as well). This approach takes anything before the last 3 numbers and uses it as the first field, so if the first field has 3 or more words or one of the "words" is a set of digits (e.g. "17 inch ant farm") then it will still work.
这篇关于读取以空格分隔的文本文件,其中第一列也有空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!