CSV解析嵌入式双引号 [英] CSV parsing for embedded double quotes

查看:170
本文介绍了CSV解析嵌入式双引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个简单的CSV文件解析器。但在查看 CSV格式的wiki页面后,我注意到了一些基本的扩展格式。通过双引号特别嵌入逗号。我已经设法解析这些,但是有第二个问题:嵌入式双引号。

I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse those, however there is a second issue: embedded double quotes.

示例:

12345,ABC,IJKXYZ - > [1234]和[ABC,IJKXYZ]

12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]

正确的方式来区分封闭的双引号和无。所以我的问题是什么是正确的方法/算法来解析CVS格式,如上面的那样?

I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?

推荐答案

认为这基本上是将引用的值看作一个单引号,非引号的值一个双引号值序列,形成一个用引号括起来的值。也就是

The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,


  • 来解析行中的下一个原子:

    • 到第一个非空白字符

    • 如果当前字符不是引号:

      • 标记当前位置

      • 读取下一个逗号或换行符

      • 返回标记和逗号之前的字符之间的文本(如果适用,条形空格)

      • to parse the next atom in the row:
        • read up to the first non whitespace character
        • if the current character is not a quote:
          • mark the current spot
          • read up to the next comma or newline
          • return the text between the mark and the character before the comma (strip spaces if appropriate)

          • 创建空字符串缓冲区

          • 当前字符不是引号

            • 标记当前位置+1(跳过引号字符)

            • 读取下一个报价

            • 如果缓冲区不为空,则向其附加引号




            • 读取下一个逗号或换行符

            • 返回缓冲区

            • create an empty string buffer
            • while the current character is not a quote
              • mark the current position +1 (skip the quote character)
              • read up to the next quote
              • if the buffer is not empty, append a quote to it
              • append to the buffer the text between the mark and the character before the current position (to strip both quotes)
              • advance one character (past the just read quote)

              本质上,将引用字符串的每个双引号段分开,然后用引号括起来。因此:ABC,IJKXYZ变为 ABC, IJK   XYZ ,后者又变成 ABC,IJK  XYZ

              essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus: "ABC, ""IJK"" XYZ" becomes ABC, , IJK,  XYZ, which in turn becomes ABC, "IJK" XYZ

              这篇关于CSV解析嵌入式双引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆