使用readr解析具有不规则引用规则的CSV [英] Parsing a CSV with irregular quoting rules using readr
问题描述
我有一个无法使用readr解析的奇怪CSV.我们称之为data.csv
.看起来像这样:
I have a weird CSV that I can't parse with readr. Let's call it data.csv
. It looks something like this:
name,info,amount_spent
John Doe,Is a good guy,5412030
Jane Doe,"Jan Doe" is cool,3159
Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451
如果所有行都像列行下面的第一个行一样-两个字符列,然后是整数列-用read_csv
可以很容易地解析:
If all of the rows were like first one below the columns row – two character columns followed by an integer column – this would be easy to parse with read_csv
:
df <- read_csv("data.csv")
但是,某些行的格式类似于第二行,因为第二列("info")包含一个字符串,其中一部分用双引号引起来,而另一部分则没有.这样一来,read_csv就不会在单词cool
作为分隔符之后读取逗号,并且随后的整行都附加到有问题的单元格上.
However, some rows are formatted like the second one, in that the second column ("info") contains a string, part of which is enclosed by double quotes and part of which is not. This makes it so read_csv doesn't read the comma after the word cool
as a delimiter, and the entire following row gets appended to the offending cell.
此类问题的一种解决方案是将FALSE
传递给read_delim()
中的escape_double
自变量,就像这样:
A solution for this kind of problem is to pass FALSE
to the escape_double
argument in read_delim()
, like so:
df <- read_delim("data.csv", delim = ",", escape_double = FALSE)
这适用于第二行,但是被第三行杀死,第二列包含用双引号括起来的字符串,该字符串本身包含嵌套的双引号和逗号.
This works for the second row, but gets killed by the third, where the second column contains a string enclosed by double quotes which itself contains nested double quotes and a comma.
我已经阅读了readr文档,但是还没有找到可以解析这两种类型的行的解决方案.
I have read the readr documentation but have as yet found no solution that would parse both types of rows.
推荐答案
您可以使用正则表达式,将正则表达式拆分为逗号(使用(*SKIP)(*FAIL)
):
You could use a regular expression which splits on the comma in question (using (*SKIP)(*FAIL)
):
input <- c('John Doe,Is a good guy,5412030', 'Jane Doe,"Jan Doe" is cool,3159',
'Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451')
lst <- strsplit(input, '"[^"]*"(*SKIP)(*FAIL)|,', perl = T)
(df <- setNames(as.data.frame(do.call(rbind, lst)), c("name","info","amount_spent")))
这产生
name info amount_spent
1 John Doe Is a good guy 5412030
2 Jane Doe "Jan Doe" is cool 3159
3 Senator Sally Doe "Sally "Sal" Doe is from New York, NY" 4451
在 regex101.com 上查看表达式演示.
See a demo for the expression on regex101.com.
这篇关于使用readr解析具有不规则引用规则的CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!