如何在忽略引号内的逗号的情况下按逗号分割? [英] How can I split by commas while ignoring any comma that's inside quotes?

查看:140
本文介绍了如何在忽略引号内的逗号的情况下按逗号分割?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Typescript文件,该文件需要一个csv文件,并使用以下代码对其进行拆分:

I have a Typescript file that takes a csv file and splits it using the following code:

var cells = rows[i].split(",");

我现在需要解决此问题,以使引号内的任何逗号都不会引起拆分.例如,The,"quick, brown fox", jumped应该拆分为Thequick, brown foxjumped,而不是也拆分quickbrown fox.正确的方法是什么?

I now need to fix this so that any comma that's inside quotes does not result in a split. For example, The,"quick, brown fox", jumped should split into The, quick, brown fox, and jumped instead of also splitting quick and brown fox. What is the proper way to do this?

推荐答案

更新:

我认为一行的最终版本应该是:

I think the final version in a line should be:

var cells = (rows[i] + ',').split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/).slice(1).reduce((a, b) => (a.length > 0 && a[a.length - 1].length < 4) ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]] : [...a, [b]], []).map(e => e.reduce((a, b) => a !== undefined ? a : b, undefined))

或者说得更漂亮:

var cells = (rows[i] + ',')
  .split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/)
  .slice(1)
  .reduce(
    (a, b) => (a.length > 0 && a[a.length - 1].length < 4)
      ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
      : [...a, [b]],
    [],
  )
  .map(
    e => e.reduce(
      (a, b) => a !== undefined ? a : b, undefined,
    ),
  )
;

这很长,但看起来仍然是纯功能.让我解释一下:

This is rather long, but still looks purely functional. Let me explain it:

首先是正则表达式部分.基本上,您想要的细分可能会分为3种可能性:

First, the regular expression part. Basically, a segment you want may fall into 3 possibilities:

  1. *?([^",]+?) *?,,它是一个不含",的字符串,并用空格括起来,后跟一个,.
  2. " *?(.+?)" *?,,它是一个字符串,由一对引号引起来,并在引号后加上无限数量的空格,后跟一个,.
  3. ( *?),,它是不确定数量的空格,后跟一个','.
  1. *?([^",]+?) *?,, which is a string without " or , surrounded with spaces, followed by a ,.
  2. " *?(.+?)" *?,, which is a string, surrounded with a pair of quotes and an indefinite number of spaces beyond the quotes, followed by a ,.
  3. ( *?),, which is an indefinite number of spaces, followed by a ','.

因此,将这三个部分的一个非捕获组分割成一个联合,基本上可以使我们得到答案.

So splitting by a non-capturing group of a union of these three will basically get us to the answer.

回想一下,使用正则表达式拆分时,结果数组由以下组成:

Recall that when splitting with a regular expression, the resulting array consists of:

  1. 由分隔符(正则表达式)分隔的字符串
  2. 分隔符中的所有捕获组

在我们的例子中,分隔符填充了整个字符串,因此,被分离的字符串都是空字符串,除了最后一个所需的部分外,该部分被省略了,因为后面没有,.因此,结果数组应类似于:

In our case, the separators fill the whole string, so the strings separated are all empty strings, except that last desired part, which is left out because there is no , following it. Thus the resulting array should be like:

  1. 一个空字符串
  2. 三个字符串,代表匹配的第一个分隔符的三个捕获组
  3. 一个空字符串
  4. 三个字符串,代表匹配的第二个分隔符的三个捕获组
  5. ...
  6. 一个空字符串
  7. 最后一个想要的部分,一个人呆着

那么为什么只在末尾添加,以便我们可以获得完美的图案呢?这就是(rows[i] + ',')产生的方式.

So why simply adding a , at the end so that we can get a perfect pattern? This is how (rows[i] + ',') comes about.

在这种情况下,结果数组将变为捕获由空字符串分隔的组.删除第一个空字符串,它们将以4个一组的形式显示为[第一捕获组,第二捕获组,第三捕获组,空字符串].

In this case the resulting array becomes capturing groups separated by empty strings. Removing the first empty string, they will appear in a group of 4 as [ 1st capturing group, 2nd capturing group, 3rd capturing group, empty string ].

reduce块的作用是将它们精确地分为4组:

What the reduce block does is exactly grouping them into groups of 4:

  .reduce(
    (a, b) => (a.length > 0 && a[a.length - 1].length < 4)
      ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
      : [...a, [b]],
    [],
  )

最后,找到第一个非undefined元素(一个不匹配的捕获组将显示为undefined.我们的三个模式是互斥的,因为它们中的任意两个不能同时匹配.因此,恰好有一个这样的元素在每个组中)恰好是所需的部分:

And finally, find the first non-undefined elements (an unmatched capturing group will appear as undefined. Our three patterns are exclusive in that any 2 of them cannot be matched simultaneously. So there is exactly 1 such element in each group) in each group which are precisely the desired parts:

  .map(
    e => e.reduce(
      (a, b) => a !== undefined ? a : b, undefined,
    ),
  )

这完成了解决方案.

我认为以下内容就足够了:

I think the following should suffice:

var cells = rows[i].split(/([^",]+?|".+?") *, */).filter(e => e)

或者如果您不想引号:

var cells = rows[i].split(/(?:([^",]+?)|"(.+?)") *, */).filter(e => e)

这篇关于如何在忽略引号内的逗号的情况下按逗号分割?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆