如何使用Javascript解析CSV字符串,其中包含数据中的逗号? [英] How can I parse a CSV string with Javascript, which contains comma in data?

查看:124
本文介绍了如何使用Javascript解析CSV字符串,其中包含数据中的逗号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下类型的字符串

var string = "'string, duppi, du', 23, lala"

我想将字符串拆分为每个逗号上的数组,但只有单引号外的逗号标记。

I want to split the string into an array on each comma, but only the commas outside the single quotation marks.

我无法找出合适的正则表达式...

I cant figure out the right regex for the split...

string.split(/,/)

会给我

["'string", " duppi", " du'", " 23", " lala"]

但结果应该是:

["string, duppi, du", "23", "lala"]

有没有交叉浏览器解决方案?

is there any cross browser solution?

推荐答案

免责声明



2014-12-01更新:以下答案仅适用于一种非常特定的CSV格式。正如DG在评论中正确指出的那样,此解决方案不符合RFC 4180的CSV定义,也不适合MS Excel格式。此解决方案简单演示了如何解析包含混合字符串类型的一个(非标准)CSV输入行,其中字符串可能包含转义引号和逗号。

Disclaimer

2014-12-01 Update: The answer below works only for one very specific format of CSV. As correctly pointed out by DG in the comments, this solution does NOT fit the RFC 4180 definition of CSV and it also does NOT fit MS Excel format. This solution simply demonstrates how one can parse one (non-standard) CSV line of input which contains a mix of string types, where the strings may contain escaped quotes and commas.

正如austincheney正确指出的那样,如果您希望正确处理可能包含的带引号的字符串,您真的需要从头到尾解析字符串逃脱的人物。此外,OP没有明确定义CSV字符串究竟是什么。首先,我们必须定义什么构成有效的CSV字符串及其各个值。

As austincheney correctly points out, you really need to parse the string from start to finish if you wish to properly handle quoted strings that may contain escaped characters. Also, the OP does not clearly define what a "CSV string" really is. First we must define what constitutes a valid CSV string and its individual values.

为了便于讨论,CSV字符串由零个或多个值组成,其中多个值由逗号分隔。每个值可能包含:

For the purpose of this discussion, a "CSV string" consists of zero or more values, where multiple values are separated by a comma. Each value may consist of:


  1. 双引号字符串。 (可能包含未转义的单引号。)

  2. 单引号字符串。 (可能包含未转义的双引号。)

  3. 未引用的字符串。 (不得包含引号,逗号或反斜杠。)

  4. 空值。 (所有空格值都被视为空。)

  1. A double quoted string. (may contain unescaped single quotes.)
  2. A single quoted string. (may contain unescaped double quotes.)
  3. A non-quoted string. (may NOT contain quotes, commas or backslashes.)
  4. An empty value. (An all whitespace value is considered empty.)

规则/注释:


  • 引用的值可能包含逗号。

  • 引用的值可能包含转义任何内容,例如'那个很酷'

  • 必须引用包含引号,逗号或反斜杠的值。

  • 必须引用包含前导或尾随空格的值。

  • 反斜杠将从所有内容中删除: \'单引号值。

  • 反斜杠从所有: \中删除​​双引号值。

  • 修剪任何前导和尾随空格的非引用字符串。

  • 逗号分隔符可能有相邻的空格(忽略)。

  • Quoted values may contain commas.
  • Quoted values may contain escaped-anything, e.g. 'that\'s cool'.
  • Values containing quotes, commas, or backslashes must be quoted.
  • Values containing leading or trailing whitespace must be quoted.
  • The backslash is removed from all: \' in single quoted values.
  • The backslash is removed from all: \" in double quoted values.
  • Non-quoted strings are trimmed of any leading and trailing spaces.
  • The comma separator may have adjacent whitespace (which is ignored).

一个JavaScript函数,它将有效的CSV字符串(如上所定义)转换为数组字符串值。

A JavaScript function which converts a valid CSV string (as defined above) into an array of string values.

此解决方案使用的正则表达式很复杂。(恕我直言)所有非平凡正则表达式应该以自由间隔模式呈现,并带有大量注释和缩进。不幸的是,JavaScript不允许自由空间g模式。因此,此解决方案实现的正则表达式首先以本机正则表达式语法呈现(使用Python表示方便: r'''...''' raw-多行字符串语法)。

The regular expressions used by this solution are complex. And (IMHO) all non-trivial regexes should be presented in free-spacing mode with lots of comments and indentation. Unfortunately, JavaScript does not allow free-spacing mode. Thus, the regular expressions implemented by this solution are first presented in native regex syntax (expressed using Python's handy: r'''...''' raw-multi-line-string syntax).

这里首先是一个正则表达式,用于验证CVS字符串是否满足上述要求:

First here is a regular expression which validates that a CVS string meets the above requirements:

re_valid = r"""
# Validate a CSV string having single, double or un-quoted values.
^                                   # Anchor to start of string.
\s*                                 # Allow whitespace before value.
(?:                                 # Group for value alternatives.
  '[^'\\]*(?:\\[\S\s][^'\\]*)*'     # Either Single quoted string,
| "[^"\\]*(?:\\[\S\s][^"\\]*)*"     # or Double quoted string,
| [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*    # or Non-comma, non-quote stuff.
)                                   # End group of value alternatives.
\s*                                 # Allow whitespace after value.
(?:                                 # Zero or more additional values
  ,                                 # Values separated by a comma.
  \s*                               # Allow whitespace before value.
  (?:                               # Group for value alternatives.
    '[^'\\]*(?:\\[\S\s][^'\\]*)*'   # Either Single quoted string,
  | "[^"\\]*(?:\\[\S\s][^"\\]*)*"   # or Double quoted string,
  | [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*  # or Non-comma, non-quote stuff.
  )                                 # End group of value alternatives.
  \s*                               # Allow whitespace after value.
)*                                  # Zero or more additional values
$                                   # Anchor to end of string.
"""

如果字符串与上述正则表达式匹配,那么该字符串是有效的CSV字符串(根据前面所述的规则)并且可以使用以下正则表达式进行解析。然后使用以下正则表达式匹配CSV字符串中的一个值。重复应用它直到找不到更多匹配项(并且所有值都已被删除)解析)。

If a string matches the above regex, then that string is a valid CSV string (according to the rules previously stated) and may be parsed using the following regex. The following regex is then used to match one value from the CSV string. It is applied repeatedly until no more matches are found (and all values have been parsed).

re_value = r"""
# Match one value in valid CSV string.
(?!\s*$)                            # Don't match empty last value.
\s*                                 # Strip whitespace before value.
(?:                                 # Group for value alternatives.
  '([^'\\]*(?:\\[\S\s][^'\\]*)*)'   # Either $1: Single quoted string,
| "([^"\\]*(?:\\[\S\s][^"\\]*)*)"   # or $2: Double quoted string,
| ([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)  # or $3: Non-comma, non-quote stuff.
)                                   # End group of value alternatives.
\s*                                 # Strip whitespace after value.
(?:,|$)                             # Field ends on comma or EOS.
"""

请注意,此正则表达式不匹配有一个特殊情况值 - 当该值为空时的最后一个值。这个特殊的空的最后一个值情况由下面的js函数测试和处理。

Note that there is one special case value that this regex does not match - the very last value when that value is empty. This special "empty last value" case is tested for and handled by the js function which follows.

// Return array of string values, or NULL if CSV string not well formed.
function CSVtoArray(text) {
    var re_valid = /^\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*(?:,\s*(?:'[^'\\]*(?:\\[\S\s][^'\\]*)*'|"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^,'"\s\\]*(?:\s+[^,'"\s\\]+)*)\s*)*$/;
    var re_value = /(?!\s*$)\s*(?:'([^'\\]*(?:\\[\S\s][^'\\]*)*)'|"([^"\\]*(?:\\[\S\s][^"\\]*)*)"|([^,'"\s\\]*(?:\s+[^,'"\s\\]+)*))\s*(?:,|$)/g;
    // Return NULL if input string is not well formed CSV string.
    if (!re_valid.test(text)) return null;
    var a = [];                     // Initialize array to receive values.
    text.replace(re_value, // "Walk" the string using replace with callback.
        function(m0, m1, m2, m3) {
            // Remove backslash from \' in single quoted values.
            if      (m1 !== undefined) a.push(m1.replace(/\\'/g, "'"));
            // Remove backslash from \" in double quoted values.
            else if (m2 !== undefined) a.push(m2.replace(/\\"/g, '"'));
            else if (m3 !== undefined) a.push(m3);
            return ''; // Return empty string.
        });
    // Handle special case of empty last value.
    if (/,\s*$/.test(text)) a.push('');
    return a;
};



输入和输出示例:



在以下示例中,花括号用于分隔 {result strings} 。(这是为了帮助可视化前导/尾随空格和零长度字符串。)

Example input and output:

In the following examples, curly braces are used to delimit the {result strings}. (This is to help visualize leading/trailing spaces and zero-length strings.)

// Test 1: Test string from original question.
var test = "'string, duppi, du', 23, lala";
var a = CSVtoArray(test);
/* Array hes 3 elements:
    a[0] = {string, duppi, du}
    a[1] = {23}
    a[2] = {lala} */



// Test 2: Empty CSV string.
var test = "";
var a = CSVtoArray(test);
/* Array hes 0 elements: */



// Test 3: CSV string with two empty values.
var test = ",";
var a = CSVtoArray(test);
/* Array hes 2 elements:
    a[0] = {}
    a[1] = {} */



// Test 4: Double quoted CSV string having single quoted values.
var test = "'one','two with escaped \' single quote', 'three, with, commas'";
var a = CSVtoArray(test);
/* Array hes 3 elements:
    a[0] = {one}
    a[1] = {two with escaped ' single quote}
    a[2] = {three, with, commas} */



// Test 5: Single quoted CSV string having double quoted values.
var test = '"one","two with escaped \" double quote", "three, with, commas"';
var a = CSVtoArray(test);
/* Array hes 3 elements:
    a[0] = {one}
    a[1] = {two with escaped " double quote}
    a[2] = {three, with, commas} */



// Test 6: CSV string with whitespace in and around empty and non-empty values.
var test = "   one  ,  'two'  ,  , ' four' ,, 'six ', ' seven ' ,  ";
var a = CSVtoArray(test);
/* Array hes 8 elements:
    a[0] = {one}
    a[1] = {two}
    a[2] = {}
    a[3] = { four}
    a[4] = {}
    a[5] = {six }
    a[6] = { seven }
    a[7] = {} */



附加说明:



此解决方案要求CSV字符串为有效。例如,未加引号的值可能不包含反斜杠或引号,例如以下CSV字符串无效:

Additional notes:

This solution requires that the CSV string be "valid". For example, unquoted values may not contain backslashes or quotes, e.g. the following CSV string is NOT valid:

var invalid1 = "one, that's me!, escaped \, comma"

这不是一个限制,因为任何子字符串都可以表示为单引号或双引号。另请注意,此解决方案仅代表一种可能的定义:逗号分隔值。

This is not really a limitation because any sub-string may be represented as either a single or double quoted value. Note also that this solution represents only one possible definition for: "Comma Separated Values".

编辑:2014-05-19:已添加免责声明。
编辑:2014-12-01:将免责声明移至顶部。

2014-05-19: Added disclaimer. 2014-12-01: Moved disclaimer to top.

这篇关于如何使用Javascript解析CSV字符串,其中包含数据中的逗号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆