如何使用多字节分隔符将文本文件读入GNU R? [英] How to read a text file into GNU R with a multiple-byte separator?

查看:132
本文介绍了如何使用多字节分隔符将文本文件读入GNU R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以使用read.csv或read.csv2读取数据到R.但我遇到的问题是,我的分隔符是一个多字节字符串而不是单个字符。如何处理此问题?

I can use read.csv or read.csv2 to read data into R. But the issue I encountered is that my separator is a multiple-byte string instead of a single character. How can I deal with this?

推荐答案

提供示例数据会有所帮助。

Providing example data would help. However, you might be able to adapt the following to your needs.

我创建了一个示例数据文件,它只是一个包含以下内容的文本文件:

I created an example data file, which is a just a text file containing the following:

1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3
1sep2sep3

我将其另存为'test.csv'。分隔字符是'sep'字符串。我认为 read.csv()使用 scan(),它只接受一个字符 sep 。为了解决这个问题,请考虑以下方面:

I saved it as 'test.csv'. The separation character is the 'sep' string. I think read.csv() uses scan(), which only accepts a single character for sep. To get around it, consider the following:

dat <- readLines('test.csv')
dat <- gsub("sep", " ", dat)
dat <- textConnection(dat)
dat <- read.table(dat)

readLines()只是读取的行。 gsub 将多字符分隔字符串替换为单个'',或任何方便您的数据。然后 textConnection() read.data()方便地读取所有内容。对于较小的数据集,这应该很好。如果您有非常大的数据,请考虑使用AWK等预处理来替换多字符分隔字符串。以上内容来自 http://tolstoy.newcastle.edu .au / R / e4 / help / 08/04 / 9296.html

readLines() just reads the lines in. gsub substitutes the multi-character seperation string for a single ' ', or whatever is convenient for your data. Then textConnection() and read.data() reads everything back in conveniently. For smaller datasets, this should be fine. If you have very large data, consider preprocessing with something like AWK to substitute the multi-character separation string. The above is from http://tolstoy.newcastle.edu.au/R/e4/help/08/04/9296.html .

更新 >
关于您的评论,如果您的数据中有空格,请使用其他替换分隔符。考虑将 test.csv 更改为:

1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3
1sep2 2sep3 

然后,使用以下函数:

readMulti <- function(x, sep, replace, as.is = T)
{
    dat <- readLines(x)
    dat <- gsub(sep, replace, dat)
    dat <- textConnection(dat)
    dat <- read.table(dat, sep = replace, as.is = as.is)

    return(dat)
}

尝试:

readMulti('test.csv', sep = "sep", replace = "\t", as.is = T)

在这里,使用制表符( \t )替换原始分隔符。 as.is 被传递给 read.table()以防止字符串被读入是因素,但这是你的呼叫。如果您的数据中有更复杂的空格,您可能会在 read.table()中找到 quote ,或使用AWK,perl等进行预处理。

Here, you replace the original separator with tabs (\t). The as.is is passed to read.table() to prevent strings being read in is factors, but that's your call. If you have more complicated white space within your data, you might find the quote argument in read.table() helpful, or pre-process with AWK, perl, etc.

与crippledlambda的 strsplit()等价于适度大小的数据。如果效果成为问题,请尝试两种方法,并查看哪些作品适合您。

Something similar with crippledlambda's strsplit() is most likely equivalent for moderately sized data. If performance becomes an issue, try both and see which works for you.

这篇关于如何使用多字节分隔符将文本文件读入GNU R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆