如何在忽略引号内的逗号的情况下按逗号分割? [英] How can I split by commas while ignoring any comma that's inside quotes?
问题描述
我有一个Typescript文件,该文件需要一个csv文件,并使用以下代码对其进行拆分:
I have a Typescript file that takes a csv file and splits it using the following code:
var cells = rows[i].split(",");
我现在需要解决此问题,以使引号内的任何逗号都不会引起拆分.例如,The,"quick, brown fox", jumped
应该拆分为The
,quick, brown fox
和jumped
,而不是也拆分quick
和brown fox
.正确的方法是什么?
I now need to fix this so that any comma that's inside quotes does not result in a split. For example, The,"quick, brown fox", jumped
should split into The
, quick, brown fox
, and jumped
instead of also splitting quick
and brown fox
. What is the proper way to do this?
推荐答案
更新:
我认为一行的最终版本应该是:
I think the final version in a line should be:
var cells = (rows[i] + ',').split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/).slice(1).reduce((a, b) => (a.length > 0 && a[a.length - 1].length < 4) ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]] : [...a, [b]], []).map(e => e.reduce((a, b) => a !== undefined ? a : b, undefined))
或者说得更漂亮:
var cells = (rows[i] + ',')
.split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/)
.slice(1)
.reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)
.map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)
;
这很长,但看起来仍然是纯功能.让我解释一下:
This is rather long, but still looks purely functional. Let me explain it:
首先是正则表达式部分.基本上,您想要的细分可能会分为3种可能性:
First, the regular expression part. Basically, a segment you want may fall into 3 possibilities:
-
*?([^",]+?) *?,
,它是一个不含"
或,
的字符串,并用空格括起来,后跟一个,
. -
" *?(.+?)" *?,
,它是一个字符串,由一对引号引起来,并在引号后加上无限数量的空格,后跟一个,
. -
( *?),
,它是不确定数量的空格,后跟一个','.
*?([^",]+?) *?,
, which is a string without"
or,
surrounded with spaces, followed by a,
." *?(.+?)" *?,
, which is a string, surrounded with a pair of quotes and an indefinite number of spaces beyond the quotes, followed by a,
.( *?),
, which is an indefinite number of spaces, followed by a ','.
因此,将这三个部分的一个非捕获组分割成一个联合,基本上可以使我们得到答案.
So splitting by a non-capturing group of a union of these three will basically get us to the answer.
回想一下,使用正则表达式拆分时,结果数组由以下组成:
Recall that when splitting with a regular expression, the resulting array consists of:
- 由分隔符(正则表达式)分隔的字符串
- 分隔符中的所有捕获组
在我们的例子中,分隔符填充了整个字符串,因此,被分离的字符串都是空字符串,除了最后一个所需的部分外,该部分被省略了,因为后面没有,
.因此,结果数组应类似于:
In our case, the separators fill the whole string, so the strings separated are all empty strings, except that last desired part, which is left out because there is no ,
following it. Thus the resulting array should be like:
- 一个空字符串
- 三个字符串,代表匹配的第一个分隔符的三个捕获组
- 一个空字符串
- 三个字符串,代表匹配的第二个分隔符的三个捕获组
- ...
- 一个空字符串
- 最后一个想要的部分,一个人呆着
那么为什么只在末尾添加,
以便我们可以获得完美的图案呢?这就是(rows[i] + ',')
产生的方式.
So why simply adding a ,
at the end so that we can get a perfect pattern? This is how (rows[i] + ',')
comes about.
在这种情况下,结果数组将变为捕获由空字符串分隔的组.删除第一个空字符串,它们将以4个一组的形式显示为[第一捕获组,第二捕获组,第三捕获组,空字符串].
In this case the resulting array becomes capturing groups separated by empty strings. Removing the first empty string, they will appear in a group of 4 as [ 1st capturing group, 2nd capturing group, 3rd capturing group, empty string ].
reduce
块的作用是将它们精确地分为4组:
What the reduce
block does is exactly grouping them into groups of 4:
.reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)
最后,找到第一个非undefined
元素(一个不匹配的捕获组将显示为undefined
.我们的三个模式是互斥的,因为它们中的任意两个不能同时匹配.因此,恰好有一个这样的元素在每个组中)恰好是所需的部分:
And finally, find the first non-undefined
elements (an unmatched capturing group will appear as undefined
. Our three patterns are exclusive in that any 2 of them cannot be matched simultaneously. So there is exactly 1 such element in each group) in each group which are precisely the desired parts:
.map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)
这完成了解决方案.
我认为以下内容就足够了:
I think the following should suffice:
var cells = rows[i].split(/([^",]+?|".+?") *, */).filter(e => e)
或者如果您不想引号:
var cells = rows[i].split(/(?:([^",]+?)|"(.+?)") *, */).filter(e => e)
这篇关于如何在忽略引号内的逗号的情况下按逗号分割?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!