使用fread读取带有双引号和不正确的转义符的数据 [英] Using fread to read data with double quotes and incorrect escape characters

查看:37
本文介绍了使用fread读取带有双引号和不正确的转义符的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 data.table 包中的 fread()加载大型数据文件(约2000万行).但是,有些行会造成很大的麻烦.

I try to load a large datafile (some 20 million rows) using fread() from the data.table package. Some rows are causing great trouble, however.

最小示例:

text.csv contains:

id, text
1,"""Oops"",\""The"",""Georgia"""        

fread("text.csv", sep=",")

Error in fread("text.csv", sep = ",") : 
  Not positioned correctly after testing format of header row. ch=','
In addition: Warning message:
In fread("text.csv", sep = ",") :
  Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text

read.table()效果更好,但速度太慢且内存效率太低.

read.table() works somewhat better but is too slow and too memory inefficient.

> read.table("text.csv", header = TRUE, sep=",")
  id                     text
1  1 "Oops",\\"The","Georgia"

我意识到我的文本文件格式不正确,但是太大了,无法以实用的方式进行编辑.

I realize that my text file is not properly formatted, but it is too large to edit in a practical manner.

非常感谢任何帮助.

一小部分实际数据记录:

A small sample of actual data records:

sample1.txt, a good record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music

> fread("sample1.txt", sep=",")
               materiale_id      dk5                      description         creator subject-phrase
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA
                                           title  type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music


sample2.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article

> fread("sample2.txt", sep=",")
Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"...

更新到R版本3.2.3和data.table 1.9.6.在上述方面有所帮助,但与其他记录产生了问题:

Updating to R version 3.2.3 and data.table 1.9.6. helps on the above, but creates issues with other records:

sample3.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music
125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") : 
  Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

更新到数据表的开发版本1.9.7会完全破坏 fread():

Updating to the development version 1.9.7 of data tables breaks fread() altogether:

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") : 
  showProgress is not type integer but type 'logical'. Please report.

似乎我的文件中的问题出现在记录包含字符串 \\"(有点不正常,不是正则表达式)的情况下.显然,反斜杠太多了,导致 fread()会将双引号误解为字符串的结尾,而该字符串本来应该是乱七八糟的.

It seems that the problem in my file occurs when records contain the string \\" (litteraly, not regular expression). Apparently, there's one backslash too many, causing fread() to misinterpret a double quote as the end of a string, where it should have been taken litteraly.

到目前为止,我最好的解决方案是执行以下操作:

My best solutions so far is to do this:

m1 <- readLines("data.csv", encoding="UTF-8")
m2 <- gsub("\\\\\"", "\\\"", m1)    
writeLines(m2, "data_new.csv", useBytes = TRUE)
m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")

这似乎行得通.

尽管我不明白这是100%,所以任何澄清都值得欢迎.

I don't understand this 100% though, so any clarifications are more than welcome.

推荐答案

不是 data.table 解决方案,但您可以尝试:

Not a data.table solution, but you could try:

# read the file with 'readLines'
tmp <- readLines("trl.txt")

# create a column name vector of the first line
nms <- trimws(strsplit(tmp[1],',')[[1]])

# convert 'tmp' to a dataframe except the first line
tmp <- as.data.frame(tmp[-1])

# use 'separate' from 'tidyr' to split into two columns
library(tidyr)
df1 <- separate(tmp, "tmp[-1]", nms, sep=",", extra = "merge")

给出:

> df1
  id                             text
1  1 """Oops"",\\""The"",""Georgia"""


更新以进行编辑1 :对于新的示例数据, fread 似乎正在正常读取数据:


Update for edit 1: With the new example data fread seems to be reading the data normally:

> s1 <- fread("sample1.txt", sep=",")
> s1
               materiale_id      dk5                      description         creator subject-phrase                                         title  type
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA Koncert i Copenhagen Jazz House den 26.1.1995 music


> s2 <- fread("sample2.txt", sep=",")
> s2
               materiale_id      dk5
1: 125030-katalog:000000003 [78.793]
2:   150012-leksikon:100019         
                                                                                                                                                                                                           description
1:                                                                                                                                                                                    Privatoptagelse. - Liveoptagelse
2: Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...
           creator                                                                                                                         subject-phrase
1: Frederik Lundin                                                                                                                                       
2:                 [""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]
                                           title    type
1: Koncert i Copenhagen Jazz House den 26.1.1995   music
2:               it - elementer i databehandling article


更新以进行编辑2&3:

当您查看错误消息时:

fread("sample3.txt",sep =,")中的错误:预期为7列,但第3行包含处理完所有列之后的文本.很可能这是由于一个或多个字段嵌入了 sep =','和/或(不转义)不平衡的不转义引号中的'\ n'字符. fread 无法处理这种模棱两可的情况,而这些线可能不是按预期阅读.请阅读?fread 中有关引号的部分.

Error in fread("sample3.txt", sep = ",") : Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

,然后查看 sample3.txt 的第二行,您将看到第四列也包含逗号.您可以通过三个步骤解决此问题:

and then when you look at the second line of sample3.txt you will see that the fourth column contains comma's as well. You can solve this in three steps:

1: readLines 读取文件,并将第四列的开头和结尾字符替换为另一个引号字符:

1: Read the file with readLines and replace the opening and closing character of the fourth column with another quote-character:

r3 <- readLines("sample3.txt")
r3 <- gsub('\"[',"'",r3,fixed=TRUE)
r3 <- gsub(']\"',"'",r3,fixed=TRUE)

2 :将其写回文本文件:

2: Write it back to a text-file:

 writeLines(r3, "sample3-1.txt")

3 :现在,您可以使用 fread (或 read.table / read.csv )进行阅读.由于列标题的数量与列的数量不同,因此必须使用 header = FALSE .还明确地将quote-character设置为在步骤2中插入的新的quote-character.

3: Now you can read it with fread (or read.table/read.csv). Because the number of column-titles is not the same as the number of columns, you will have to use header = FALSE. Also explicitely set the quote-character to the new quote-character as inserted in step 2:

s3 <- fread("sample3-1.txt", quote = "\'", header = FALSE, skip = 1)

给出:

> s3
                         V1 V2 V3                   V4                                                        V5           V6                               V7    V8
1: 125030-katalog:000236595 NA NA Red Tampa Solist prf                         ""Tom"",""Georgia"",""1929-1930"" "Georgia Tom                       1929-1930" music
2: 125030-katalog:000236596 NA NA  Jane Lucas (Solist) ""1928-1931"",""Tom,\\""The"",""Georgia"",""Accompanist"" "Georgia Tom ""The Accompanist"" (1928-1931)" music

之后,您可以按以下方式分配列名:

After that you can assign column names as follows:

names(s3) <- c("character","vector","with","eight","column","names")

注意:我为此使用了v1.9.7的最新版本(两个星期)

这篇关于使用fread读取带有双引号和不正确的转义符的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆