使用正则表达式,如何有效地匹配双引号与嵌入双引号之间的字符串? [英] Using regexes, how to efficiently match strings between double quotes with embedded double quotes?

查看:24
本文介绍了使用正则表达式,如何有效地匹配双引号与嵌入双引号之间的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们有一个文本,我们想在其中匹配双引号之间的所有字符串;但是在这些双引号中,可以引用双引号.示例:

Let us have a text in which we want to match all strings between double quotes; but within these double quotes, there can be quoted double quotes. Example:

"He said \"Hello\" to me for the first time"

使用正则表达式,你如何有效地匹配它?

Using regexes, how do you match this efficiently?

推荐答案

匹配此类输入的一个非常有效的解决方案是使用 normal* (special normal*)* 模式;这个名字引用自 Jeffrey Friedl 的优秀著作掌握正则表达式.

A very efficient solution to match such inputs is to use the normal* (special normal*)* pattern; this name is quoted from the excellent book by Jeffrey Friedl, Mastering Regular Expressions.

这种模式通常用于匹配由常规条目(正常部分)和中间的分隔符(特殊部分)组成的输入.

It is a pattern useful in general to match inputs consisting of regular entries (the normal part) with separators inbetween (the special part).

请注意,正则表达式和所有事物一样,在没有更好的选择时应该使用它;虽然可以使用这种模式来解析 CSV 数据,例如,如果您使用 Java,则最好改用 OpenCSV.

Note that like all things regex, it should be used when there is no better choice; while one could use this pattern for parsing CSV data, for instance, if you use Java, you're better off using OpenCSV instead.

另请注意,虽然模式名称中的量词是星号(即零个或多个),但您可以根据需要改变它们.

Also note that while the quantifiers in the pattern name are stars (ie, zero or more), you can vary them to suit your needs.

让我们再看上面的例子;并且请考虑此文本示例可能位于您输入的任何位置:

Let us take the above example again; and please consider that this text sample may be anywhere in your input:

"He said \"Hello\" to me for the first time"

无论您多么努力,再多的点加贪婪/懒惰量词"魔法都无法帮助您解决问题.相反,将引号之间的输入分类为正常和特殊:

No matter how hard you try, no amount of "dot plus greedy/lazy quantifiers" magic will help you solve it. Instead, categorize the input between quotes as normal and special:

  • normal 不是反斜杠或双引号:[^\\"];
  • special 是反斜杠后跟双引号的序列:\\".

将其代入 normal* (special normal*)* 模式,得到以下正则表达式:

Substituting this into the normal* (special normal*)* pattern, this gives the following regex:

[^\\"]*(\\"[^\\"]*)*

添加双引号以匹配全文给出最终的正则表达式:

Adding the double quotes around to match the full text gives the final regex:

"[^\\"]*(\\"[^\\"]*)*"

您会注意到这也将匹配空引用的字符串.

You will note that this will also match empty quoted strings.

这里我们将不得不在量词上使用一个变体,因为:

Here we will have to use a variant on the quantifiers, since:

  • 我们不想要空话,
  • 我们不希望单词以破折号开头,
  • 当出现破折号时,它必须在另一个破折号之前至少有一个字母(如果有).

为简单起见,我们还将假设只允许使用小写的 ASCII 字母.

For simplicity, we will also suppose that only lowercase, ASCII letters are allowed.

样本输入:

the-word-to-match

让我们再次分解为普通和特殊:

Let us decompose again into normal and special:

  • 正常:小写,ASCII 字母:[a-z];
  • 特殊:破折号:-

模式的规范形式是:

[a-z]*(-[a-z]*)*

但正如我们所说:

  • 我们不希望单词以破折号开头:第一个 * 应该变成 +;
  • 当找到破折号时,它后面应该至少有一个字母:第二个*应该变成+.
  • we don't want words starting with a dash: the first * should become +;
  • when a dash is found, there should be at least one letter after it: the second * should become +.

我们最终得到:

[a-z]+(-[a-z]+)*

在其周围添加词锚以获得最终结果:

Adding word anchors around it to obtain the final result:

\b[a-z]+(-[a-z]+)*\b

其他运算符变体

以上示例仅限于将 * 替换为 +,但当然您可以有任意数量的变体.一个非常经典的例子是 IP 地址:

Other operator variations

The examples above limit themselves to replacing * with +, but of course you can have as many variations as you wish. One ultra classical example would be an IP address:

  • 正常是最多三位数(\d{1,3}),
  • 特殊的是点:(\.),
  • 第一个 normal 只出现一次,因此没有量词,
  • (special normal*)中的normal也只出现一次,因此没有量词,
  • 最后,(special normal*) 部分恰好出现了 3 次,因此 {3}.
  • normal is up to three digits (\d{1,3}),
  • special is the dot: (\.),
  • the first normal appears only once, therefore no quantifier,
  • the normal inside the (special normal*) also appears only once, therefore no quantifier,
  • finally the (special normal*) part appears exactly three times, therefore {3}.

哪个给出了表达(用词锚装饰):

Which gives the expresison (decorated with word anchors):

\b\d{1,3}(\.\d{1,3}){3}\b

结论

这种模式的灵活性使其成为正则表达式工具箱中最有用的工具之一.虽然存在许多问题,如果库存在,您不应该使用正则表达式,但在某些情况下,您必须使用正则表达式.一旦你练习了一点,这将成为你最好的朋友之一!

Conclusion

This pattern's flexibility makes it one of the most useful tools in your regex toolbox. While many problems exist which you should not use regexes for if libraries exist, in some situations, you have to use regexes. And this will become one of your best friends once you have practiced with it a bit!

  • 很可能您不需要(或不想)捕获重复的部分((special normal*) 部分);因此建议您使用非捕获组.例如,使用 "[^\\"]*(?:\\"[^\\"]*)*" 引用字符串.事实上,如果你想要它,捕获几乎在这种情况下永远不会导致预期的结果,因为重复捕获组只会为您提供最后捕获(所有先前的重复都将被覆盖),除非您在 .NET 中使用此模式.(谢谢@ohaal)
  • It is more than likely that you don't need (or want) to capture the repeated part (the (special normal*) part); it is therefore recommended that you use a non-capturing group. For instance, use "[^\\"]*(?:\\"[^\\"]*)*" for quoted strings. In fact, had you wanted it, capturing would almost never lead to the desired results in this case, because repeating a capturing group will only ever give you the last capture (all previous repetitions will be overwritten), unless you are using this pattern in .NET. (thanks @ohaal)

这篇关于使用正则表达式,如何有效地匹配双引号与嵌入双引号之间的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆