通过跳过包含在引号之间的换行符分隔CSV字符串 [英] Split a CSV string by line skipping newlines contained between quotes

查看:218
本文介绍了通过跳过包含在引号之间的换行符分隔CSV字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果以下正则表达式可以按行拆分csv字符串。

  var lines = csv.split(/ \r | \r?\\\
/ g);如何适应跳过包含在CSV值中的换行字符(即引号/引号之间的换行符号)?



<双引号)?



示例:

  2,Evans &Sutherland,230-132-111AA,,Visual,P 
CB,, 1,Offsite,

如果您没有看到它,那么这里有一个可以看到换行符的版本:

  2,Evans& Sutherland,230-132-111AA,,Visual,P\r\\\
CB,, 1,Offsite,\r\\\

我想跳过的部分是包含在PCB条目中间的换行符。



$ b

更新: ve之前提到过,但这是一个专用的CSV解析库的一部分,名为 jquery-csv



下面是验证和解析条目(即一行)的代码:

  $。csvEntry2Array = function(csv,meta){
var meta =(meta!== undefined? meta:{});
var separator ='separator'in meta? meta.separator:$ .csvDefaults.separator;
var delimiter ='delimiter'in meta? meta.delimiter:$ .csvDefaults.delimiter;

//构建CSV验证器regex
var reValid = / ^ \s *(?:D [^ D\\] *(?: \\ [\\ \\ S \ s] [^ D\\] *)* D | [^ SD\s\\] *(?:\s + [^ SD\s\\] +) *)\s *(?:S \s *(?:D [^ D\\] *(?: \\ [\S\s] [^ D\\] *)* D | [^ SD\s\\] *(?:\s + [^ SD\s\\] +)*)\s *)* $ /
reValid = RegExp(reValid.source.replace(/ S / g,separator));
reValid = RegExp(reValid.source.replace(/ D / g,delimiter));

//构建CSV行解析器regex
var reValue = /(?! \s * $)\s *(?:D([^ D\\] *(?:\\ [\S\s] [^ D\\] *)*)D |([^ SD\s\\] *(?:\s + ^ SD\s\\] +)*))\s *(?:S | $)/ g;
reValue = RegExp(reValue.source.replace(/ S / g,separator),'g');
reValue = RegExp(reValue.source.replace(/ D / g,delimiter),'g');

//如果输入字符串格式不正确则返回NULL CSV字符串。
if(!reValid.test(csv)){
return null;
}

//使用replace回调函数Walk字符串。
var output = [];
csv.replace(reValue,function(m0,m1,m2){
//从值的任何分隔符中删除反斜杠
if(m1!== undefined){
var reDelimiterUnescape = / \\D/g;
reDelimiterUnescape = RegExp(reDelimiterUnescape.source.replace(/ D /,delimiter),'g');
output.push(m1.replace reDelimiterUnescape,delimiter));
} else if(m2!== undefined){
output.push(m2);
}
return'';
} );

//处理空最后值的特殊情况。
var reEmptyLast = / S\s * $ /;
reEmptyLast = RegExp(reEmptyLast.source.replace(/ S /,separator));
if(reEmptyLast.test(csv)){
output.push('');
}

返回输出;
};

注意:我还没有测试,但我想我可能包含最后一场比赛



这是执行拆分部分的代码:

  $。csv2Array = function(csv,meta){
var meta =(meta!== undefined?meta:{} );
var separator ='separator'in meta? meta.separator:$ .csvDefaults.separator;
var delimiter ='delimiter'in meta? meta.delimiter:$ .csvDefaults.delimiter;
var skip ='skip'in meta? meta.skip:$ .csvDefaults.skip;

//按行处理
var lines = csv.split(/ \r\\\
| \r | \\\
/ g);
var output = [];
for(var i in lines){
if(i< skip){
continue;
}
//处理每个值
var line = $ .csvEntry2Array(lines [i],{
delimiter:delimiter,
separator:separator
});
output.push(line);
}

返回输出;
};

有关注册工作的细节,请参阅这个答案。矿是微调的版本。我合并了单引号和双引号匹配,只匹配一个文本分隔符,并使分隔符/分隔符动态。它确实是一个伟大的工作验证entiries,但我添加在上面的线分裂解决方案是相当虚弱,打破了上面描述的边缘情况。



我只是寻找一个解决方案,该方法将字符串提取出有效的条目(传递给条目解析器),或者在坏数据上返回错误,指示解析失败的行。



更新

  splitLines:function(csv,delimiter){
var state = 0;
var value =;
var line =;
var lines = [];
function endOfRow(){
lines.push(value);
value =;
state = 0;
};
csv.replace(/(\|,| \\\
| \r | [^ \,\r\\\
] +)/ gm,function(m0){
switch(state){
//条目的开始
case 0:
if(m0 ===\){
state = 1;
} else if(m0 ===\\\
){
endOfRow();
} else if(/^\r$/.test(m0)){
//回车被忽略
} else {
value + = m0;
state = 3;
}
break;
// delimited输入
case 1:
if(m0 ===\){
state = 2;
} else {
value + = m0;
state = 1;
}
break;
//在分隔输入中找到分隔符
case 2:
//是分隔符转义吗?
if(m0 ===\&& value.substr(value.length - 1)===\){
value + = m0;
state = 1;
} else if(m0 ===,){
value + = m0;
state = 0;
} else if(m0 ===\\\
){
endOfRow();
} else if(m0 ===\r){
//忽略
} else {
throw new Error(Illegal state);
}
break;
//未分隔的输入
case 3:
if(m0 ===,){
value + = m0;
state = 0;
} else if(m0 ===\){
throw new Error(Unquoted delimiter found);
} else if(m0 ===\\\
){
endOfRow();
} else if(m0 ===\r){
//忽略
} else {
throw new错误(非法数据);
}
断点;
默认值:
throw new Error(Unknown state);
}
return ;
});
if(state!= 0){
endOfRow();
}
返回行;
}

所有所需的是分线器的4种状态:




  • 0:条目的开头

  • 1:引用以下内容

  • 2:遇到第二个报价

  • 3:以下内容未引用



这几乎是一个完整的解析器。对于我的用例,我只是想要一个分线器,所以我可以提供一个更具体的方法来处理CSV数据。



这种方法的信用到另一个开发者,我不会公开命名未经他的许可。

更新:



在前面的lineSplitter实现中发现了一些断开的边缘情况。所提供的应完全符合 RFC 4180 规范。

解决方案

正如我在注释中指出的,没有完整的解决方案只使用单个正则表达式。



通过拆分逗号并加入带有逗号的字符串的几个regexp,此处: -



个人而言,我将使用一个简单的有限状态机此处



状态机有更多的代码,但是代码更干净,它清楚每个代码块是做什么的。长期来说,这将是更可靠和可维护的。


If the following regex can split a csv string by line.

var lines = csv.split(/\r|\r?\n/g);

How could this be adapted to skip newline chars that are contained within a CSV value (Ie between quotes/double-quotes)?

Example:

2,"Evans & Sutherland","230-132-111AA",,"Visual","P
CB",,1,"Offsite",

If you don't see it, here's a version with the newlines visible:

2,"Evans & Sutherland","230-132-111AA",,"Visual","P\r\nCB",,1,"Offsite",\r\n 

The part I'm trying to skip over is the newline contained in the middle of the "PCB" entry.

Update:

I probably should've mentioned this before but this is a part of a dedicated CSV parsing library called jquery-csv. To provide a better context I have added the current parser implementation below.

Here's the code for validating and parsing an entry (ie one line):

$.csvEntry2Array = function(csv, meta) {
  var meta = (meta !== undefined ? meta : {});
  var separator = 'separator' in meta ? meta.separator : $.csvDefaults.separator;
  var delimiter = 'delimiter' in meta ? meta.delimiter : $.csvDefaults.delimiter;

  // build the CSV validator regex
  var reValid = /^\s*(?:D[^D\\]*(?:\\[\S\s][^D\\]*)*D|[^SD\s\\]*(?:\s+[^SD\s\\]+)*)\s*(?:S\s*(?:D[^D\\]*(?:\\[\S\s][^D\\]*)*D|[^SD\s\\]*(?:\s+[^SD\s\\]+)*)\s*)*$/;
  reValid = RegExp(reValid.source.replace(/S/g, separator));
  reValid = RegExp(reValid.source.replace(/D/g, delimiter));

  // build the CSV line parser regex
  var reValue = /(?!\s*$)\s*(?:D([^D\\]*(?:\\[\S\s][^D\\]*)*)D|([^SD\s\\]*(?:\s+[^SD\s\\]+)*))\s*(?:S|$)/g;
  reValue = RegExp(reValue.source.replace(/S/g, separator), 'g');
  reValue = RegExp(reValue.source.replace(/D/g, delimiter), 'g');

  // Return NULL if input string is not well formed CSV string.
  if (!reValid.test(csv)) {
    return null;
  }

  // "Walk" the string using replace with callback.
  var output = [];
  csv.replace(reValue, function(m0, m1, m2) {
    // Remove backslash from any delimiters in the value
    if (m1 !== undefined) {
      var reDelimiterUnescape = /\\D/g;              
      reDelimiterUnescape = RegExp(reDelimiterUnescape.source.replace(/D/, delimiter), 'g');
      output.push(m1.replace(reDelimiterUnescape, delimiter));
    } else if (m2 !== undefined) { 
      output.push(m2);
    }
    return '';
  });

  // Handle special case of empty last value.
  var reEmptyLast = /S\s*$/;
  reEmptyLast = RegExp(reEmptyLast.source.replace(/S/, separator));
  if (reEmptyLast.test(csv)) {
    output.push('');
  }

  return output;
};

Note: I haven't tested yet but I think I could probably incorporate the last match into the main split/callback.

This is the code that does the split-by-line part:

$.csv2Array = function(csv, meta) {
  var meta = (meta !== undefined ? meta : {});
  var separator = 'separator' in meta ? meta.separator : $.csvDefaults.separator;
  var delimiter = 'delimiter' in meta ? meta.delimiter : $.csvDefaults.delimiter;
  var skip = 'skip' in meta ? meta.skip : $.csvDefaults.skip;

  // process by line
  var lines = csv.split(/\r\n|\r|\n/g);
  var output = [];
  for(var i in lines) {
    if(i < skip) {
      continue;
    }
    // process each value
    var line = $.csvEntry2Array(lines[i], {
      delimiter: delimiter,
      separator: separator
    });
    output.push(line);
  }

  return output;
};

For a breakdown on how that reges works take a look at this answer. Mine is a slightly adapted version. I consolidated the single and double quote matching to match just one text delimiter and made the delimiter/separators dynamic. It does a great job of validating entiries but the line-splitting solution I added on top is pretty frail and breaks on the edge case I described above.

I'm just looking for a solution that walks the string extracting valid entries (to pass on to the entry parser) or fails on bad data returning an error indicating the line the parsing failed on.

Update:

splitLines: function(csv, delimiter) {
  var state = 0;
  var value = "";
  var line = "";
  var lines = [];
  function endOfRow() {
    lines.push(value);
    value = "";
    state = 0;
  };
  csv.replace(/(\"|,|\n|\r|[^\",\r\n]+)/gm, function (m0){
    switch (state) {
      // the start of an entry
      case 0:
        if (m0 === "\"") {
          state = 1;
        } else if (m0 === "\n") {
          endOfRow();
        } else if (/^\r$/.test(m0)) {
          // carriage returns are ignored
        } else {
          value += m0;
          state = 3;
        }
        break;
      // delimited input  
      case 1:
        if (m0 === "\"") {
          state = 2;
        } else {
          value += m0;
          state = 1;
        }
        break;
      // delimiter found in delimited input
      case 2:
        // is the delimiter escaped?
        if (m0 === "\"" && value.substr(value.length - 1) === "\"") {
          value += m0;
          state = 1;
        } else if (m0 === ",") {
          value += m0;
          state = 0;
        } else if (m0 === "\n") {
          endOfRow();
        } else if (m0 === "\r") {
          // Ignore
        } else {
          throw new Error("Illegal state");
        }
        break;
      // un-delimited input
      case 3:
        if (m0 === ",") {
          value += m0;
          state = 0;
        } else if (m0 === "\"") {
          throw new Error("Unquoted delimiter found");
        } else if (m0 === "\n") {
          endOfRow();
        } else if (m0 === "\r") {
          // Ignore
        } else {
          throw new Error("Illegal data");
        }
          break;
      default:
        throw new Error("Unknown state");
    }
    return "";
  });
  if (state != 0) {
    endOfRow();
  }
  return lines;
}

All it took is 4 states for a line splitter:

  • 0: the start of an entry
  • 1: the following is quoted
  • 2: a second quote has been encountered
  • 3: the following isn't quoted

It's almost a complete parser. For my use case, I just wanted a line splitter so I could provide a more granual approach to processing CSV data.

Note: Credit for this approach goes to another dev whom I won't name publicly without his permission. All I did was adapt it from a complete parser to a line-splitter.

Update:

Discovered a few broken edge cases in the previous lineSplitter implementation. The one provided should be fully RFC 4180 compliant.

解决方案

As I have noted in a comment there is no complete solution just using single regex.

A novel method using several regexps by splitting on comma and joining back strings with embedded commas is described here:-

Personally I would use a simple finite state machine as described here

The state machine has more code, but the code is cleaner and its clear what each piece of code is doing. Longer term this will be much more reliable and maintainable.

这篇关于通过跳过包含在引号之间的换行符分隔CSV字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆