CSV解析嵌入式双引号 [英] CSV parsing for embedded double quotes
问题描述
我写了一个简单的CSV文件解析器。但在查看 CSV格式的wiki页面后,我注意到了一些基本的扩展格式。通过双引号特别嵌入逗号。我已经设法解析这些,但是有第二个问题:嵌入式双引号。
I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse those, however there is a second issue: embedded double quotes.
示例:
12345,ABC,IJKXYZ - > [1234]和[ABC,IJKXYZ]
12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]
正确的方式来区分封闭的双引号和无。所以我的问题是什么是正确的方法/算法来解析CVS格式,如上面的那样?
I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?
推荐答案
认为这基本上是将引用的值看作一个单引号,非引号的值或一个双引号值序列,形成一个用引号括起来的值。也就是
The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,
- 来解析行中的下一个原子:
- 到第一个非空白字符
- 如果当前字符不是引号:
- 标记当前位置
- 读取下一个逗号或换行符
- 返回标记和逗号之前的字符之间的文本(如果适用,条形空格)
- to parse the next atom in the row:
- read up to the first non whitespace character
- if the current character is not a quote:
- mark the current spot
- read up to the next comma or newline
- return the text between the mark and the character before the comma (strip spaces if appropriate)
- 创建空字符串缓冲区
- 当前字符不是引号
- 标记当前位置+1(跳过引号字符)
- 读取下一个报价
- 如果缓冲区不为空,则向其附加引号
- 读取下一个逗号或换行符
- 返回缓冲区
- create an empty string buffer
- while the current character is not a quote
- mark the current position +1 (skip the quote character)
- read up to the next quote
- if the buffer is not empty, append a quote to it
- append to the buffer the text between the mark and the character before the current position (to strip both quotes)
- advance one character (past the just read quote)
本质上,将引用字符串的每个双引号段分开,然后用引号括起来。因此:
ABC,IJKXYZ
变为ABC,
,IJK
,XYZ
,后者又变成ABC,IJK XYZ
essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus:
"ABC, ""IJK"" XYZ"
becomesABC,
,IJK
,XYZ
, which in turn becomesABC, "IJK" XYZ
这篇关于CSV解析嵌入式双引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!