读取以空格分隔的文本文件,其中第一列也有空格 [英] Reading a space delimited text file where first column also has spaces

查看:29
本文介绍了读取以空格分隔的文本文件,其中第一列也有空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将一个文本文件读入 R,如下所示:

I'm trying to read a text file into R that looks like this:

Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87

我的问题是文件以空格分隔,第一个变量有一些包含空格的条目,因此读入 R 会由于不同的行有更多的列而产生错误.有谁知道读这个的方法吗?

My problem is that the file is space delimited and the first variable has some entries that include a space so reading into R creates an error due to different rows having more columns. Does anyone know a way to read this in?

推荐答案

Ben 的方法效果很好,但这里是另一种使用 `strapplycgsubfn 的方法来自 gsubfn 包的捆绑.

Ben's approach works great, but here is another approach using `strapplyc, gsubfn or strapply from the gsubfn package.

首先读入数据并设置col.names、要使用的分隔符和模式:

First read in the data and set col.names, the separator and the pattern to use:

r <- readLines(textConnection(
 "Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87"))

library(gsubfn)

col.names <- c("group", "x1", "x2", "x3")
sep <- ","  # if comma can appear in fields use something else
pat <- "^(.*) +(\\d+) +(\\d+) +(\\d+) *$"

1) gsubfn

tmp <- sapply(strapplyc(r, pat), paste, collapse = sep)
read.table(text = tmp, col.names = col.names, as.is = TRUE, sep = sep)

2) strplyc 或者相同的代码,但最后两条语句被替换为:

2) strapplyc Alternately the same code but the last two statement are replaced with:

tmp <- gsubfn(pat, ... ~ paste(..., sep = sep), r)
read.table(text = tmp, col.names = col.names, as.is = TRUE, sep = sep)

3) 束带.这个和后面的变体不需要定义 sep.

3) strapply. This one and the variation that follows do not require that sep be defined.

library(data.table)
tmp <- strapply(r, pat,
  ~ data.table(
      group = group, 
      x1 = as.numeric(x1), 
      x2 = as.numeric(x2), 
      x3 = as.numeric(x3)
    ))
rbindlist(tmp)

3a) 这个涉及一些额外的操作,所以我们可能更喜欢其他解决方案之一,但为了完整性,这里是.combine=list 防止单个输出被修改,simplify=c 删除了 combine=list 添加的额外层.最后我们rbind把所有的东西都放在一起.

3a) This one involves some extra manipulation so we might favor one of the other solutions instead but for completeness here it is. The combine=list prevents the individual outputs from being munged and the simplify=c removes the extra layer that combine=list added. Finally we rbind everything together.

tmp <- strapply(r, pat,
  ~ data.frame(
      group = group, 
      x1 = as.numeric(x1), 
      x2 = as.numeric(x2), 
      x3 = as.numeric(x3),
      stringsAsFactors = FALSE
    ), combine = list, simplify = c)
do.call(rbind, tmp)

4) read.pattern gsubfn 包的开发版有一个新功能read.pattern 对这类问题特别直接:

4) read.pattern The development version of the gsubfn package has a new function read.pattern that is particularly direct for this type of problem:

library(devtools) # source_url
source_url("https://gsubfn.googlecode.com/svn/trunk/R/read.pattern.R") # from dev repo

read.pattern(text = r, pattern = pat, col.names = col.names, as.is = TRUE)

注意:这些方法有几个优点(尽管 Ben 的方法也可以针对这些情况进行修改).这种方法取最后 3 个数字之前的任何内容并将其用作第一个字段,因此如果第一个字段有 3 个或更多单词,或者其中一个单词"是一组数字(例如17 英寸蚂蚁农场"),那么它仍然会工作.

Note: These approaches have a couple of advantages (though Ben's approach could be modified for these cases as well). This approach takes anything before the last 3 numbers and uses it as the first field, so if the first field has 3 or more words or one of the "words" is a set of digits (e.g. "17 inch ant farm") then it will still work.

这篇关于读取以空格分隔的文本文件,其中第一列也有空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆