fread（）：读取带有\r\r\\\<br/>作为换行符的表 [英] fread(): reading table with \r\r\n as newline symbol

查看：1236 发布时间：2017/3/12 11:44:51 r performance data.table line-endings

本文介绍了fread（）：读取带有\r\r\\\<br/>作为换行符的表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在文本文件中有制表符分隔表，其中所有行以\r\r\\\（ 0x0D 0x0D 0x0A ）。如果我尝试用 fread（）读取这样的文件，它说

I have tab-delimited tables in text files where all lines end with \r\r\n (0x0D 0x0D 0x0A). If I try to read such file with fread(), it says

结束是\r\r\\\
。 R的download.file（）似乎在Windows上的文本模式中添加了额外的\r
。请在二进制模式下重新下载
（mode ='wb'），这可能更快。或者，将URL
直接传递给fread，它将以二进制模式为
下载文件。

Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

但我不下载这些文件，我已经有他们。

but I am not downloading these files, I already have them.

到目前为止，我来到解决方案，首先读取的文件与读。 table（）（它将\r\r\\\组合作为单个行尾字符），然后将结果 data.frame 由 data.table（）：

So far I came to the solution which first reads the file with read.table() (it treats \r\r\n combination as a single end-of-line character), then converts resulting data.frame by data.table():

mydt <- data.table(read.table(myfilename, header = T, sep = '\t', fill = T))

但我想知道是否有任何方法避免缓慢 read.table（）请使用快速 fread（）。

but I am wondering if there's any way to avoid slow read.table() and use fast fread() instead.

推荐答案

实用程序 tr 来摆脱那些不必要的 \r 字符。例如

I suggest using the GNU utility tr to get rid of those unnecessary \r characters. e.g.

cat("a,b,c\r\r\n1, 2, 3\r\r\n4, 5, 6", file = "test.csv")
fread("test.csv")
## Error in fread("test.csv") : 
##  Line ending is \r\r\n. R's download.file() appears to add the extra \r in text mode on Windows. Please download again in binary mode (mode='wb') which might be faster too. Alternatively, pass the URL directly to fread and it will download the file in binary mode for you.

system("tr -d '\r' < test.csv > test2.csv")
fread("test2.csv")
##    a b c
## 1: 1 2 3
## 2: 4 5 6

如果您使用的是Windows并且没有 tr 实用程序，您可以获得它此处。

If you are using Windows and do not have the tr utility, you can get it here.

已添加：

我对三种方法进行了一些比较，使用了100,000 x 5的样本cvs数据集。

I did some comparisons of three methods, using a 100,000 x 5 sample cvs dataset.

OPcsv 是slow read.table 方法

freadScan 是舍弃纯R中的额外 \r 字符的方法

freadtr 通过shell使用 fread（）直接调用GNU tr 。



OPcsv is the "slow" read.table method  
freadScan is a method that discards the extra \r characters in pure R  
freadtr calls GNU tr through the shell using fread() directly.  

第三种方法是最快的。
# create a 100,000 x 5 sample dataset with lines ending in \r\r\n
delim <- "\r\r\n"
sample.txt <- paste0("a, b, c, d, e", delim)
for (i in 1:100000) {
    sample.txt <- paste0(sample.txt,
                        paste(round(runif(5)*100), collapse = ","),
                        delim)
}
cat(sample.txt, file = "sample.csv")


# function that translates the extra \r characters in R only
fread2 <- function(filename) {
    tmp <- scan(file = filename, what = "character", quiet = TRUE)
    # remove empty lines caused by \r
    tmp <- tmp[tmp != ""]
    # paste lines back together together with \n character
    tmp <- paste(tmp, collapse = "\n")
    fread(tmp)
}

# OP function using read.csv that is slow
readcsvMethod <- function(myfilename)
    data.table(read.table(myfilename, header = TRUE, sep = ',', fill = TRUE))

require(microbenchmark)
microbenchmark(OPcsv = readcsvMethod("sample.csv"),
               freadScan = fread2("sample.csv"),
               freadtr = fread("tr -d \'\\r\' < sample.csv"),
               unit = "relative")
## Unit: relative
##           expr      min       lq     mean   median       uq      max neval
##          OPcsv 1.331462 1.336524 1.340037 1.365397 1.366041 1.249223   100
##      freadScan 1.532169 1.581195 1.624354 1.673691 1.676596 1.355434   100
##        freadtr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100


                        这篇关于fread（）：读取带有\r\r\\\<br/>作为换行符的表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

fread（）：读取带有\r\r\\\<br/>作为换行符的表 [英] fread(): reading table with \r\r\n as newline symbol

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

fread（）：读取带有\r\r\\\<br/>作为换行符的表 [英] fread(): reading table with \r\r\n as newline symbol

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭