使用readr解析具有不规则引用规则的CSV [英] Parsing a CSV with irregular quoting rules using readr

查看:117
本文介绍了使用readr解析具有不规则引用规则的CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个无法使用readr解析的奇怪CSV.我们称之为data.csv.看起来像这样:

I have a weird CSV that I can't parse with readr. Let's call it data.csv. It looks something like this:

name,info,amount_spent
John Doe,Is a good guy,5412030
Jane Doe,"Jan Doe" is cool,3159
Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451

如果所有行都像列行下面的第一个行一样-两个字符列,然后是整数列-用read_csv可以很容易地解析:

If all of the rows were like first one below the columns row – two character columns followed by an integer column – this would be easy to parse with read_csv:

df <- read_csv("data.csv")

但是,某些行的格式类似于第二行,因为第二列("info")包含一个字符串,其中一部分用双引号引起来,而另一部分则没有.这样一来,read_csv就不会在单词cool作为分隔符之后读取逗号,并且随后的整行都附加到有问题的单元格上.

However, some rows are formatted like the second one, in that the second column ("info") contains a string, part of which is enclosed by double quotes and part of which is not. This makes it so read_csv doesn't read the comma after the word cool as a delimiter, and the entire following row gets appended to the offending cell.

此类问题的一种解决方案是将FALSE传递给read_delim()中的escape_double自变量,就像这样:

A solution for this kind of problem is to pass FALSE to the escape_double argument in read_delim(), like so:

df <- read_delim("data.csv", delim = ",", escape_double = FALSE)

这适用于第二行,但是被第三行杀死,第二列包含用双引号括起来的字符串,该字符串本身包含嵌套的双引号逗号.

This works for the second row, but gets killed by the third, where the second column contains a string enclosed by double quotes which itself contains nested double quotes and a comma.

我已经阅读了readr文档,但是还没有找到可以解析这两种类型的行的解决方案.

I have read the readr documentation but have as yet found no solution that would parse both types of rows.

推荐答案

您可以使用正则表达式,将正则表达式拆分为逗号(使用(*SKIP)(*FAIL)):

You could use a regular expression which splits on the comma in question (using (*SKIP)(*FAIL)):

input <- c('John Doe,Is a good guy,5412030', 'Jane Doe,"Jan Doe" is cool,3159',
           'Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451')

lst <- strsplit(input, '"[^"]*"(*SKIP)(*FAIL)|,', perl = T)

(df <- setNames(as.data.frame(do.call(rbind, lst)), c("name","info","amount_spent")))

这产生

               name                                   info amount_spent
1          John Doe                          Is a good guy      5412030
2          Jane Doe                      "Jan Doe" is cool         3159
3 Senator Sally Doe "Sally "Sal" Doe is from New York, NY"         4451

regex101.com 上查看表达式演示.

See a demo for the expression on regex101.com.

这篇关于使用readr解析具有不规则引用规则的CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆