读取带有嵌入式双引号和逗号的CSV文件 [英] Read CSV file with embedded double quotes and commas

查看:233
本文介绍了读取带有嵌入式双引号和逗号的CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用data.table包中的fread()函数读取肮脏的CSV文件,但字符串值中嵌入的双引号和逗号有问题,即引号字段中存在未转义的双引号.以下示例数据说明了我的问题.它由3行/行和6列组成,第一行包含列名称:

I am trying to read a dirty CSV file using the fread() function from the data.table package but have a problem with embedded double quotes and commas in the string values, that is, unescaped double quotes present in a quoted field. The following example data illustrates my problem. It consists of 3 lines/rows and 6 columns, with the first line containing the column names:

"SA","SU","CC","CN","POC","PAC"
"NE","R","000","H "B", O","1","8"
"A","A","000","P","E,5","8"

第一个问题在第2行中,该行具有一对嵌入的双引号和一个逗号:"H "B", O".第二个问题在第3行中,双引号内的逗号为:"E,5".我尝试了以下方法:

The first problem is in line 2 which has an embedded pair of double quotes and a comma: "H "B", O". The second problem is in line 3 which has a comma within the double quotes: "E,5". I have tried the following:

尝试1

library(data.table)
x1 <- fread(file = "example.csv", quote = "\"")

输出:

> x1
     V1 "SA" "SU"   "CC" "CN" "POC" "PAC"
1: "NE"  "R"    0 "H "B"   O"   "1"     8
2:  "A"  "A"    0    "P"   "E    5"     8

消息:

在前100行中找到并解决了不正确的报价.如果字段 不加引号(例如,字段分隔符不出现在任何 字段),请尝试quote ="以避免此警告.检测到6个列名,但 数据有7列(即无效文件).增加了1个默认值 第一列的列名被猜测为行名或 索引.如果此猜测不正确,请在以后使用setnames(),或者 修复创建文件的文件写入命令以创建有效文件 文件.

Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

结论:结果不正确,因为它添加了新列V1.

Conclusion: the result is incorrect because it adds a new column V1.

尝试2

x2 <- fread(file = "example.csv", quote = "")

输出:

> x2
     V1 "SA"  "SU"   "CC" "CN" "POC" "PAC"
1: "NE"  "R" "000" "H "B"   O"   "1"   "8"
2:  "A"  "A" "000"    "P"   "E    5"   "8"

消息:

检测到6列名称,但数据有7列(即无效 文件).为第一列添加了1个额外的默认列名称,即 猜测是行名或索引.如果这之后使用setnames() 猜测不正确,或修复了创建该文件的文件写入命令 文件以创建有效文件.

Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

结论:结果不正确,因为它添加了新列V1 ..

Conclusion: the result is incorrect because it adds a new column V1..

解决方案?

我正在寻找一种获得类似于

What I am looking for is a way to get an output similar to

> x3
   SA SU CC       CN POC PAC
1: NE  R  0 H 'B', O   1   8
2:  A  A  0        P E,5   8

最好使用fread(),但也欢迎其他建议.

preferably using fread(), but other suggestions are welcome.

推荐答案

您可以尝试预先清除数据,然后将双引号替换为单引号.

You could try cleaning your data beforehand and replace the double quotes with single quotes.

x = readLines('my_file.csv')
y = gsub('","', "','", x) # replace double quotes for each field
y = gsub('^"|"$', "'", y) # replace trailing and leading double quotes
z = paste(y, collapse='\n') # turn it back into a table for fread to read
df = fread(z, quote="'")
df

   SA SU CC       CN POC PAC
1: NE  R  0 H "B", O   1   8
2:  A  A  0        P E,5   8

由于我不知道您的文件有多大,所以我无法确定这样做是否有效,但这可能是一种值得的方法.

I can't confirm that this is efficient since I don't know how big your file is, but it might be a worthwhile approach.

这篇关于读取带有嵌入式双引号和逗号的CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆