匹配独特的群体,同时保持他们的秩序 [英] Matching unique groups while maintaining their order
问题描述
有没有一种方法可以完全在正则表达式中按出现顺序匹配唯一的字符组(下例中的单词)?如果是这样,该表达式与非正则表达式解决方案的效率相比如何?我正在使用 Python 的风格,但我也对任何其他风格的解决方案感兴趣.
Is there a way to match unique groups of characters (words in the case below) in order of occurrence, purely in regex? If so, how does that expression compare in efficiency to a non-regex solution? I'm working with Python's flavor, but I would be interested in a solution for any other flavor, as well.
这是一个示例案例:
string = 'the floodwaters are rising along the coast'
unique = ['the', 'floadwaters', 'are', 'rising', 'along', 'coast']
在 Python-regex 混合解决方案中,我可以匹配我想要的组,并使用列表理解来删除重复项,同时保持顺序.
In a Python-regex hybrid solution I could match the groups I want, and use a list comprehension to remove the duplicates while maintaining the order.
groups = re.findall('[a-zA-Z]+', string)
unique = [g for i, g in enumerate(groups) if g not in groups[:i]]
整个网站都有类似的问题,例如解决匹配唯一字词的问题.然而,接受的答案中的表达式匹配给定组最右边的出现,而我想匹配 first 出现.这是这个表达:
There are similar questions across the site, such as one that addresses matching unique words. The expression from the accepted answer, however, matches the furthest right occurence of a given group, while I want to match the first occurence. Here's that expression:
(\w+\b)(?!.*\1\b)
推荐答案
只有 infinite-width 后视,才能为此类任务提供纯正则表达式解决方案.
A regex-only solution for this kind of task is only possible with an infinite-width lookbehind.
但是,像这样的正则表达式解决方案应该只考虑在输入相对较短的情况下使用:输入字符串中超过 100 个单词会使其非常慢由于回溯,在这种情况下这是不可避免的.因此,仅出于学习目的,我将分享仅在 .NET 和 Python PyPi 中支持的正则表达式 regex
库(也可以在 Vim 中这样做,因为它的后视也是无限宽度的,但我想这个强大的工具还有更简单的方法).
However, a regex solution like this should only be considered for use when the input is relatively short: more than 100 words in an input string will make it very slow due to backtracking that is inevitable in this case. Thus, for a mere learning purpose, I will share the regex that is only supported in .NET and Python PyPi regex
library (it is also possible to do in Vim as its lookbehind is also infinite-width, but I guess there are even simpler ways with that powerful tool).
(?s)\b(\w+)\b(?<!^.*\b\1\b.*\b\1\b)
参见 正则表达式演示
(?s)
部分是一个内联修饰符,它使 .
匹配所有换行符.你可以在 Python regex
中使用 regex.DOTALL
.
The (?s)
part is an inline modifier that makes .
match all line breaks. You may use regex.DOTALL
in Python regex
.
详情
\b
- 初始词边界(\w+)
- 第 1 组:一个或多个单词字符\b
- 尾随词边界(?<!^.*\b\1\b.*\b\1\b)
- 一个无限宽度的负向后视,如果单词匹配到第 1 组,则匹配失败碰巧在其自身之前至少出现一次,即,如果在当前位置的左侧(即捕获的单词之后),有一系列模式:^
- 字符串的开始.*\b\1\b
- 任何零个或多个字符,尽可能多,然后与第 1 组中的值相同作为一个整体.*\b\1\b
- 同上(需要匹配捕获的词,因为lookbehind在消费词之后使用)
\b
- initial word boundary(\w+)
- Group 1: one or more word chars\b
- trailing word boundary(?<!^.*\b\1\b.*\b\1\b)
- an infinite width negative lookbehind that fails the match if the word matched into Group 1 happens to appear at least once before itself, i.e. if, immediately to the left of the current location (that is right after the word captured), there is a sequence of patterns:^
- start of string.*\b\1\b
- any zero or more chars, as many as possible and then the same value as in Group 1 as a whole word.*\b\1\b
- same as above (needed to match the captured word, since the lookbehind is used after the consumed word)
lookbehind 中的
.*
会导致大量回溯,并且该模式通常会运行得很慢,并且在输入大时非常慢,最终可能会导致超时.The
.*
in the lookbehind causes lots of backtracking, and the pattern will work rather slow in general, and very slow with large inputs and eventually might cause time outs.这篇关于匹配独特的群体,同时保持他们的秩序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!