如何在没有自动列检测的情况下将fread()用作readLines()? [英] How to use fread() as readLines() without auto column detection?

查看:91
本文介绍了如何在没有自动列检测的情况下将fread()用作readLines()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个5Gb .dat文件(> 1000万行)。例如,每行的格式类似于 aaaa bb cccc0123 xxx kkkkkkkkkkkkkk aaaaabbbcccc01234xxxkkkkkkkkkkkk 。因为 readLines 在读取大文件时性能较差,所以我选择 fread()来读取它,但是发生了错误:

I have a 5Gb .dat file (> 10million lines). The format of each line is like aaaa bb cccc0123 xxx kkkkkkkkkkkkkk or aaaaabbbcccc01234xxxkkkkkkkkkkkkkk for example. Because readLines has poor performance while reading big file, I choose fread() to read this, but error was occurred:

library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") : 
  Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
  Unable to find 5 lines with expected number of columns (+ middle)

如何使用 fread()作为 readLines() 没有自动检测列?还是有其他方法可以解决此问题?

How to use fread() as readLines() without auto column detecting? Or is there any other way to solve this problem?

推荐答案

这是一个技巧。您可以使用您知道文件中没有的 sep 值。这样做会强制 fread()将整个行作为单个列读取。然后,我们可以将该列拖放到原子向量上(如下面的 [[1L]] 所示)。这是一个在csv上的示例,其中我使用作为 sep 。这样,它的行为类似于 readLines(),但速度要快得多。

Here's a trick. You could use a sep value that you know is not in the file. Doing that forces fread() to read the whole line as a single column. Then we can drop that column to an atomic vector (shown as [[1L]] below). Here's an example on a csv where I use ? as the sep. This way it acts similar to readLines(), only a lot faster.

f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"       
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"  
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,," 
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

您可以在 sep 中尝试的其他不常见字符包括 \ ^ @#= 和其他。我们可以看到,这将产生与 readLines()相同的输出。只需找到文件中不存在的 sep 值即可。

Other uncommon characters you can try in sep are \ ^ @ # = and others. We can see that this will produce the same output as readLines(). It's just a matter of finding a sep value that is not present in the file.

head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                  
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                             
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                            
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                           
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,," 

注意:正如@Cath在评论中提到的那样,您也可以只需使用换行符 \n 作为 sep 的值即可。

Note: As @Cath has mentioned in the comments, you could also simply use the line break character \n as the sep value.

这篇关于如何在没有自动列检测的情况下将fread()用作readLines()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆