使用readr解析具有不规则引用规则的CSV [英] Parsing a CSV with irregular quoting rules using readr

查看：117 发布时间：2020/7/5 18:44:21 r regex tidyverse readr

本文介绍了使用readr解析具有不规则引用规则的CSV的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个无法使用readr解析的奇怪CSV.我们称之为data.csv.看起来像这样:

I have a weird CSV that I can't parse with readr. Let's call it data.csv. It looks something like this:

name,info,amount_spent
John Doe,Is a good guy,5412030
Jane Doe,"Jan Doe" is cool,3159
Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451

如果所有行都像列行下面的第一个行一样-两个字符列，然后是整数列-用read_csv可以很容易地解析:

If all of the rows were like first one below the columns row – two character columns followed by an integer column – this would be easy to parse with read_csv:

df <- read_csv("data.csv")

但是，某些行的格式类似于第二行，因为第二列("info")包含一个字符串，其中一部分用双引号引起来，而另一部分则没有.这样一来，read_csv就不会在单词cool作为分隔符之后读取逗号，并且随后的整行都附加到有问题的单元格上.

However, some rows are formatted like the second one, in that the second column ("info") contains a string, part of which is enclosed by double quotes and part of which is not. This makes it so read_csv doesn't read the comma after the word cool as a delimiter, and the entire following row gets appended to the offending cell.

此类问题的一种解决方案是将FALSE传递给read_delim()中的escape_double自变量，就像这样:

A solution for this kind of problem is to pass FALSE to the escape_double argument in read_delim(), like so:

df <- read_delim("data.csv", delim = ",", escape_double = FALSE)

这适用于第二行，但是被第三行杀死，第二列包含用双引号括起来的字符串，该字符串本身包含嵌套的双引号和逗号.

This works for the second row, but gets killed by the third, where the second column contains a string enclosed by double quotes which itself contains nested double quotes and a comma.

我已经阅读了readr文档，但是还没有找到可以解析这两种类型的行的解决方案.

I have read the readr documentation but have as yet found no solution that would parse both types of rows.

推荐答案

您可以使用正则表达式，将正则表达式拆分为逗号(使用(*SKIP)(*FAIL)):

You could use a regular expression which splits on the comma in question (using (*SKIP)(*FAIL)):

input <- c('John Doe,Is a good guy,5412030', 'Jane Doe,"Jan Doe" is cool,3159',
           'Senator Sally Doe,"Sally "Sal" Doe is from New York, NY",4451')

lst <- strsplit(input, '"[^"]*"(*SKIP)(*FAIL)|,', perl = T)

(df <- setNames(as.data.frame(do.call(rbind, lst)), c("name","info","amount_spent")))

这产生

               name                                   info amount_spent
1          John Doe                          Is a good guy      5412030
2          Jane Doe                      "Jan Doe" is cool         3159
3 Senator Sally Doe "Sally "Sal" Doe is from New York, NY"         4451

在 regex101.com 上查看表达式演示.

See a demo for the expression on regex101.com.

这篇关于使用readr解析具有不规则引用规则的CSV的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用readr解析具有不规则引用规则的CSV [英] Parsing a CSV with irregular quoting rules using readr

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用readr解析具有不规则引用规则的CSV [英] Parsing a CSV with irregular quoting rules using readr

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭