fread()错误和奇怪的行为时读取csv [英] fread() error and strange behaviour when reading csv

查看:133
本文介绍了fread()错误和奇怪的行为时读取csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 fread() data.table 库尝试读取一个540MB的csv文件。它检测到类型时返回错误消息:

I used fread() from data.table library to try read a 540MB csv file. It returned an error message saying:

' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.

意思是什么导致错误,想要跟踪,如果它是一个错误或只是一些数据形成问题,我可以调整 fread()来处理。

I have no idea what caused the error and want to track down if it's a bug or just some data formating issue that I can tweak fread() to process.

我设法使用 read.csv()读取csv,并决定跟踪触发上面错误的行(617174行,不是行4作为上面的错误消息)我然后重新输出行和一行每个紧挨着之前和之后的违规行,写出使用 write.csv() testout.csv

I managed to read the csv using read.csv(), and decided to track down the row that triggered the error above (line 617174, not line 4 as the error message above). I then re-output the row and one row each immediately preceding and following the offending row, written out using write.csv() as testout.csv

我可以读回 testout.csv 使用 read.csv()创建一个具有3个观察值的数据帧,如预期。然而,在 testout.csv 上使用 fread(),导致只有一次观察的数据表,最后一行。

I was able to read back testout.csv using read.csv() creating a data frame with 3 observations, as expected. Using fread() on testout.csv, however, resulted in a data table with only 1 observation, which is the last row.

testout.csv 中的四行如下所示为了可读性)。

The four lines in testout.csv are below (I start a new line for each entry below for readability).

STATE __,BGN_DATE,BGN_TIME,TIME_ZONE,COUNTY,COUNTYNAME,STATE,EVTYPE ,BGN_RANGE,BGN_AZI,BGN_LOCATI,END_DATE,END_TIME,COUNTY_END,COUNTYENDN,END_RANGE,END_AZI,END_LOCATI,LENGTH,WIDTH F,MAG,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,WFO,STATEOFFIC,ZONENAMES,LATITUDE ,LATITUDE_E,LONGITUDE _,REMARKS,REFNUM

"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"

20,8/25/2006 0:00:00,07:01 :00 PM,CST,139,OSAGE,KS,TSTM WIND,5,WNW,OSAGE CITY,8/25/2006 0:00:00,07 :01:00 PM,0,NA,5,WNW,OSAGE CITY,0,0,NA,52,0,0,0,,0,,TOP,KANSAS ,East,,3840,9554,3840,9554,。,617129

20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129

20,8/25/2006 0:00:00 07:05:00 PM,CST143,OTTAWA,KS,HAIL,1,S,MINNEAPOLIS,8/25/2006 0:00:00 07:05:00 PM,0,NA,1,S,MINNEAPOLIS,0,0,NA,88,0,0,0, KANSAS,East,,3907,9743,3907,9743,Dime to nickel sized hail。
。,617130

20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail. .",617130

20,8/25/2006 0:00:00,07:07:00 PM,CST ,125,MONTGOMERY,KS,TSTM WIND,3,N,COFFEYVILLE,8/25/2006 0:00:00,07:07:00 PM NA,3,N,COFFEYVILLE,0,0,NA,61,0,0,0,,0,ICT,KANSAS,Southeast,,3705,9538 ,3705,9538,,617131

20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131

当我运行 fread(testout.csv,sep =,,verbose = TRUE ),输出为

Input contains no \n. Taking this to be a filename to open
File opened, filesize is  1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)

可能造成意想不到的结果,而错误在第一位?和任何方式吗?为了清楚,我的目标是能够使用 fread()读取主文件,即使 read.csv()到目前为止。

Any idea what may have caused the unexpected results, and the error in the first place? And any way around it? Just to be clear, my aim is to be able to use fread() to read the main file, even though read.csv() works so far.

推荐答案

更新:现在固定在GitHub上的v1.9.3:

UPDATE: Now fixed in v1.9.3 on GitHub :


  • fread() now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting.
    See: fread and a quoted multi-line column value

Windows用户正在报告成功与最新版本 GitHub

Windows users are reporting success with the latest version from GitHub.

这篇关于fread()错误和奇怪的行为时读取csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆