RegEx - 捕获组更改结果中 OR 值的顺序 [英] RegEx - Order of OR'd values in capture group changes results

查看:27
本文介绍了RegEx - 捕获组更改结果中 OR 值的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Visual Studio/XPath/RegEx:

Visual Studio / XPath / RegEx:

给定表达式:

(?<TheObject>(Car|Car Blue)) +(?<OldState>.+) +---> +(?<NewState>.+)

给定搜索字符串:

Car Blue Flying ---> Crashed

我期望:

TheObject = "Car Blue"
OldState = "Flying"
NewState = "Crashed"

我得到了什么:

TheObject = "Car"
OldState = "Blue Flying"
NewState = "Crashed"

给定新的正则表达式:

(?<TheObject>(Car Blue|Car)) +(?<OldState>.+) +---> +(?<NewState>.+)

结果是(我想要的):

TheObject = "Car Blue"
OldState = "Flying"
NewState = "Crashed"

我从概念上了解幕后发生的事情;RegEx 将它在 OR 列表中找到的第一个(从左到右)匹配项放入 组,然后继续.

I conceptually get what's happening under the hood; the RegEx is putting the first (left-to-right) match it finds in the OR'd list into the <TheObject> group and then goes on.

OR'd 列表是在运行时构建的,不能保证将Car"或Car Blue"添加到 组中的 OR'd 列表的顺序.(这是显着简化的 OR 列表)

The OR'd list is built at run time and cannot guarantee the order that "Car" or "Car Blue" is added to the OR'd list in <TheObject> group. (This is dramatically simplified OR'd list)

我可以通过将 OR'd 列表从最长到最短进行排序来蛮力它,但是,我正在寻找更优雅的东西.

I could brute force it, by sorting the OR'd list from longest to shortest, but, I was looking for something a little more elegant.

有没有办法让 组捕获它在 OR 列表中可以找到的最大的,而不是它找到的第一个?(不用我操心订单)

Is there a way to make <TheObject> group capture the largest it can find in the OR'd list instead of the first it finds? (Without me having to worry about the order)

谢谢,

推荐答案

我通常会自动同意 ltux 之类的答案,但在这种情况下不会.

I would normally automatically agree with an answer like ltux's, but not in this case.

您说交替组是动态生成的.动态生成的频率如何?如果是每个用户请求,那么对构建表达式的对象进行快速排序(按最长长度优先,或按字母顺序倒序排序)可能比编写将 (Car|Car|Car Red|CarBlue) 变成 (Car( Red| Blue)?).

You say the alternation group is generated dynamically. How frequently is it generated dynamically? If it's every user request, it's probably faster to do a quick sort (either by longest length first, or reverse-alphabetically) on the object the expression is built from than to write something that turns (Car|Car Red|Car Blue) into (Car( Red| Blue)?).

正则表达式可能需要更长的时间(您甚至可能不会注意到正则表达式的速度差异)但组装操作可能会快得多(取决于交替列表数据源的架构).

The regex may take a bit longer (you probably won't even notice a difference in the speed of the regex) but the assembly operation may be much faster (depending on the architecture of the source of your data for the alternation list).

在使用 702 个选项的简单交替测试中,在三种方法中,使用这样的选项集的结果是可比的,但这些结果都没有考虑构建字符串的时间量,随着复杂性的增加而增加的字符串增长.

In simple test of an alternation with 702 options, in three methods, results are comparable using an option set like this, but none of these results are taking into calculation the amount of time to build the string, which grows as the complexity of the string grows.

选项都一样,只是格式不同

The options are all the same, just in different formats

  • 电击
    • 电击
    • 是的
    • 施乐
    • ...
    • 苹果
    • 电击
    • 是的
    • 施乐
    • ...
    • 苹果
    • 电击
    • 是的
    • 施乐
    • ...
    • 苹果
    • 电击
    • 是的
    • 施乐
    • ...
    • 苹果

    使用 Google Chrome 和 Javascript,我尝试了三种(四种)不同的格式,并在 0-2 毫秒之间看到了一致的结果.

    Using Google Chrome and Javascript, I tried three (edit: four) different formats and saw consistent results for all between 0-2ms.

    • '优化因子分解' a(?:4|3|2|1)?
    • 按字母顺序反向排序 (?:a4|a3|a2|a1|a)
    • 因式分解 a(?:4)?|a(?:3)?|a(?:2)?|a(?:1)?.所有这些都始终在 0 到 2 毫秒内进入(不同之处在于我的机器目前可能正在做的其他事情,我想).
    • 更新:我找到了一种方法,您可以使用前瞻像这样 (?=a|a1|a2|a3|a4|a5)(.{15}|.(14}|.{13}|...|.{2}|.) 其中 15 是上限,一直向下计数到下限.
      • 如果对这种方法没有一些限制,我觉得它会导致很多问题和误报.这将是我最不喜欢的结果.如果前瞻匹配,捕获组(.{15}|...) 捕获比您希望的更多的任何场合.换句话说,它将超越比赛.
      • 'Optimized factoring' a(?:4|3|2|1)?
      • Reverse alphabetically sorting (?:a4|a3|a2|a1|a)
      • Factoring a(?:4)?|a(?:3)?|a(?:2)?|a(?:1)?. All are consistently coming in at 0 to 2ms (the difference being what else my machine might be doing at the moment, I suppose).
      • Update: I found a way that you may be able to do this without sorting in Regular Expressions, using a lookahead like this (?=a|a1|a2|a3|a4|a5)(.{15}|.(14}|.{13}|...|.{2}|.) where 15 is the upper bound counting all the way down to the lower bound.
        • Without some restraints on this method, I feel like it can lead to a lot of problems and false positives. It would be my least preferred result. If the lookahead matches, the capture group (.{15}|...) will capture more than you'll desire on any occasion where it can. In other words, it will reach ahead past the match.

        尽管与我的 Factoring 示例相比,我创造了术语 Optimized Factoring,但我不能出于任何原因推荐我的 Factoring 示例语法.排序将是最合乎逻辑的,并且比利用前瞻更易于阅读/维护.

        Though I made up the term Optimized Factoring in comparison to my Factoring example, I can't recommend my Factoring example syntax for any reason. Sorted would be the most logical, coupled with easier to read/maintain than exploiting a lookahead.

        您没有深入了解您的数据,但如果子选​​项可能包含空格并且可能重叠,您可能仍需要进一步对子组或因子进行排序,从而进一步降低优化因子"的价值.

        You haven't given much insight into your data but you may still need to sort the sub groups or factor further if the sub-options can contain spaces and may overlap, further diminishing the value of "Optimized Factoring".

        需要明确的是,我正在彻底检查为什么这里没有任何形式的因式分解.至少不是我能看到的任何方式.一个简单的 Array.Sort().Reverse().Join("|") 给出了这种情况下任何人都需要的东西.

        To be clear, I am providing a thorough examination as to why no form of factoring is a gain here. At least not in any way that I can see. A simple Array.Sort().Reverse().Join("|") gives exactly what anyone in this situation would need.

        这篇关于RegEx - 捕获组更改结果中 OR 值的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆