如何使用data.table :: fread读取未加引号的\ r [英] How to read unquoted extra \r with data.table::fread
问题描述
我要处理的数据带有一些附加的\ r字符的未加引号的文本.文件很大(500MB),数量很多(> 600),并且不能更改导出.数据可能看起来像
Data I have to process has unquoted text with some additional \r character. Files are big (500MB), copious (>600), and changing the export is not an option. Data might look like
A,B,C
等等,a,1
bloo,a \ r,b
bloo,a\r,b
blee,c,d
- 如何用data.table的
fread
处理? - 是否有更好的R读取CSV函数,其性能类似?
Repro
library(data.table)
csv<-"A,B,C\r\n
blah,a,1\r\n
bloo,a\r,b\r\n
blee,c,d\r\n"
fread(csv)
fread(csv)中的错误:当从点0开始检测类型时,应使用预期的sep(','),但换行,EOF(或其他非打印字符)在字段1处结束:布卢阿
Error in fread(csv) : Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types from point 0: bloo,a
高级复制
简单的复制可能太琐碎而无法产生规模感...
Advanced repro
The simple repro might be too trivial to give a sense of scale...
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Naive approach
fread("sample.csv")
# Akrun's approach with needing text read first
fread(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#>Error in file.info(input) : file name conversion problem -- name too long?
# Julia's approach with needing text read first
readr::read_csv(gsub("\r\n|\r", "", paste0(randomcsv,collapse="\r\n")))
#> Error: C stack usage 48029706 is too close to the limit
推荐答案
进一步@ dirk-eddelbuettel&@nrussell的建议,解决此问题的方法是对文件进行预处理.也可以在fread()中调用该处理器,但是在这里它是通过单独的步骤执行的:
Further to @dirk-eddelbuettel & @nrussell's suggestions, a way of solving this is to is to pre-process the file. The processor could also be called within fread() but here it is performed in seperate steps:
samplerecs<-c("blah,a,1","bloo,a\r,b","blee,c,d")
randomcsv<-paste0(c("A,B,C",rep(samplerecs,2000000)))
write(randomcsv,file = "sample.csv")
# Remove errant `\r`'s with tr - shown here is the Windows R solution
shell("C:/Rtools/bin/tr.exe -d '\\r' < sample.csv > sampleNEW.csv")
fread("sampleNEW.csv")
这篇关于如何使用data.table :: fread读取未加引号的\ r的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!