r-错误:处理fread(data.table)中的所有cols之后的文本 [英] r - Error: Text after processing all cols in fread (data.table)

查看:23
本文介绍了r-错误:处理fread(data.table)中的所有cols之后的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在R(3.4中导入文本文件.0),其中实际上包含4列,但第4列在第200,000+行之前大部分为空.我在包data.table(ver 1.10.4)中使用了fread()

I tried to import a text file in R (3.4.0) which actually contains 4 columns but the 4th column is mostly empty until 200,000+th row. I use the fread() in package data.table (ver 1.10.4)

fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)

我收到此错误消息:

Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) : 
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep='  ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

我检查了文件,并在第4列("8-4")的第258088行中添加了其他文本.

I checked the file and there's additional text in 258088th row in the 4th column ("8-4").

尽管如此,fill = TRUE并没有解决我所期望的问题.我认为这可能是fread()不恰当地确定列号,因为附加列在文件中出现得很晚.所以我尝试了这个:

Nevertheless, fill = TRUE did not solve this as I expected. I thought it might be fread() determining column numbers inappropriately because the additional column occurs very late in the file. So I tried this:

fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)

错误仍然存​​在.另一方面,

The error persisted. On the other hand,

fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)

这没有错误.

我以为我找到了原因,但是当我使用

I thought I found the reason, but the weird thing happened when I tested with a dummy file generated by:

write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)

在Excel的第250000行的第4列中添加"8-4".通过fread()读取时:

with the addition of a "8-4" in the 4th column of the 250000th row by Excel. When read by fread():

fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")

它工作正常,没有错误消息,这应该表明后面的一些附加列不一定会触发错误.

It worked fine with no error message, and this should indicate some late additional column not necessarily trigger error.

我也尝试更改编码("Latin-1"和"UTF-8")或引号,但均无济于事.

I also tried changing encoding ("Latin-1" and "UTF-8") or quote, but neither helped.

现在,我感到一无所知,并希望我可以利用可复制的信息来完成我的作业.谢谢您的帮助.

Now I feel clueless, and hopefully I did my homework enough with a reproducible information. Thank you for helping.

有关其他环境信息,我的sessionInfo()是:

For additional environmental info, my sessionInfo() is:

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] dplyr_0.5.0            purrr_0.2.2.2          readr_1.1.1            tidyr_0.6.3           
  [5] tibble_1.3.3           ggplot2_2.2.1          tidyverse_1.1.1        stringr_1.2.0         
  [9] microbenchmark_1.4-2.1 data.table_1.10.4     

loaded via a namespace (and not attached):
[1] Rcpp_0.12.11     cellranger_1.1.0 compiler_3.4.0   plyr_1.8.4       forcats_0.2.0   
[6] tools_3.4.0      jsonlite_1.5     lubridate_1.6.0  nlme_3.1-131     gtable_0.2.0    
[11] lattice_0.20-35  rlang_0.1.1      psych_1.7.5      DBI_0.6-1        parallel_3.4.0  
[16] haven_1.0.0      xml2_1.1.1       httr_1.2.1       hms_0.3          grid_3.4.0      
[21] R6_2.2.1         readxl_1.0.0     foreign_0.8-68   reshape2_1.4.2   modelr_0.1.0    
[26] magrittr_1.5     scales_0.4.1     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5    
[31] colorspace_1.3-2 stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2     

推荐答案

实际上,您提供的两个文件之间存在差异,我认为这是fread输出不同的原因.

Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.

第一个文件的第3列之后是该行的末尾,除了258088行之外,其中第4列是一个制表符,然后是该行的末尾.(您可以使用选项显示所有字符以确认").

The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').

另一方面,第二个文件在所有行中都有一个额外的标签,即一个新的空列.因此,在第一种情况下,fread期望3列,然后找到第4列.相反,在第二个文件中,fread期望有4列.

On the other hand the second file has in all rows an extra tab, i.e. a new empty column. So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.

我用 fill = TRUE 检查了read.table,它适用于两个文件.因此,我认为使用fread的 fill 选项可以做一些不同的事情.

I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.

我希望自 fill = TRUE 以来,所有要使用的行都可以推断出列数(以计算时间为代价).

I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).

在评论中,您可以使用一些不错的解决方法.

In the comments there are some nice workarounds you can use.

这篇关于r-错误:处理fread(data.table)中的所有cols之后的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆