Lua中的贪婪/非贪婪模式匹配和可选后缀 [英] Greedy/Non-Greedy pattern matching and optional suffixes in Lua

查看:117
本文介绍了Lua中的贪婪/非贪婪模式匹配和可选后缀的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Lua中,我正在尝试模式匹配和捕获:

In Lua, I'm trying to pattern match and capture:

+384 Critical Strike (Reforged from Parry Chance)

(+384) (Critical Strike)

后缀(Reforged from %s)是可选的.

我正在尝试使用模式匹配 Lua中的字符串(即strfind)

I'm trying to match a string in Lua using patterns (i.e. strfind)

注意:在Lua中,他们不将其称为正则表达式,而将其称为 patterns ,因为它们不是常规.

Note: In Lua they don't call them regular expressions, they call them patterns because they're not regular.

示例字符串:

+384 Critical Strike
+1128 Hit

这分为两部分,我要捕获:

This is broken down into two parts that I want to capture:

  • 数字,带有正数或负数指示符;他的案子是+384
  • 字符串,在这种情况下为Critical Strike.
  • The number, with the leading positive or negative indicator; int his case is +384
  • The string, in this case is Critical Strike.

我可以使用相当简单的 pattern 捕获它们:

I can capture these using a fairly simple pattern:

lua中的这种模式有效:

And this pattern in lua works:

local text = "+384 Critical Strike";
local pattern = "([%+%-]%d+) (.+)";
local _, _, value, stat = strfind(text, pattern);

  • value = +384
  • stat = Critical Strike
    • value = +384
    • stat = Critical Strike
    • 现在我需要扩展 regular表达式 模式,以包含可选后缀:

      Now I need to expand that regular expression pattern to include an optional suffix:

      +384 Critical Strike (Reforged from Parry Chance)
      

      其中分为:

      注意:我并不是特别在意可选的后缀;表示我没有要求来捕获它,尽管捕获它很方便.

      Note: I don't particularly care about the optional trailing suffix; meaning that I have no requirement to capture it, Although capturing it would be handy.

      这是我开始遇到贪婪捕获问题的地方.我已经拥有的模式立即执行我不希望执行的操作:

      This is where I start to get into issues with greedy capturing. Right away the pattern I already have does what I don't want it to:

      • 模式= ([%+%-]%d+) (.+)
      • 值= +384
      • stat = Critical Strike (Reforged from Parry Chance)
      • pattern = ([%+%-]%d+) (.+)
      • value = +384
      • stat = Critical Strike (Reforged from Parry Chance)

      但是让我们尝试在模式中包括后缀:

      But let's try to include the suffix in the pattern:

      具有以下模式:

      pattern = "([%+%-]%d+) (.+)( %(Reforged from .+%))?"
      

      并且我正在使用?运算符来指示后缀的01出现,但是与没有匹配.

      And I'm using the ? operator to indicate 0 or 1 appearances of the suffix but that matches nothing.

      盲目尝试将可选的后缀组从括号(更改为方括号[:

      I blindly tried changing the optional suffix group from parenthesis ( to brackets [:

      pattern = "([%+%-]%d+) (.+)[ %(Reforged from .+%)]?"
      

      但是现在比赛再次变得贪婪:

      But now the match is greedy again:

      • 值= +384
      • stat = Critical Strike (Reforged from Parry Chance)
      • value = +384
      • stat = Critical Strike (Reforged from Parry Chance)

      基于 Lua 模式参考):

      • x :(其中x不是魔术字符^ $()%之一.[] * +-?)表示字符x本身.
      • li>
      • . :(一个点)代表所有字符.
      • %a :代表所有字母.
      • %c :代表所有控制字符.
      • %d :代表所有数字.
      • %l :代表所有小写字母.
      • %p :代表所有标点符号.
      • %s :代表所有空格字符.
      • %u :代表所有大写字母.
      • %w :代表所有字母数字字符.
      • %x :代表所有十六进制数字.
      • %z :表示字符,表示形式为0.
      • %x :(其中x是任何非字母数字字符)表示字符x.这是逃避魔术角色的标准方法.当用于在模式中表示自己时,任何标点符号(甚至是非魔术字符)都可以以%"开头.
      • [set] :表示该类,它是set中所有字符的并集.可以通过用'-'分隔范围的末尾字符来指定字符范围.上述所有类%x也可以用作集合中的组件.集合中的所有其他字符都代表自己.例如,[%w _](或[_%w])代表所有字母数字字符和下划线,[0-7]代表八进制数字,[0-7%l%-]代表八进制数字加上小写字母字母加上-"字符. 范围和类之间的交互未定义.因此,[%a-z]或[a-%%]之类的模式没有意义.
      • [^ set] :代表集合的补集,其中集合的解释如上.
      • x: (where x is not one of the magic characters ^$()%.[]*+-?) represents the character x itself.
      • .: (a dot) represents all characters.
      • %a: represents all letters.
      • %c: represents all control characters.
      • %d: represents all digits.
      • %l: represents all lowercase letters.
      • %p: represents all punctuation characters.
      • %s: represents all space characters.
      • %u: represents all uppercase letters.
      • %w: represents all alphanumeric characters.
      • %x: represents all hexadecimal digits.
      • %z: represents the character with representation 0.
      • %x: (where x is any non-alphanumeric character) represents the character x. This is the standard way to escape the magic characters. Any punctuation character (even the non-magic) can be preceded by a '%' when used to represent itself in a pattern.
      • [set]: represents the class which is the union of all characters in set. A range of characters can be specified by separating the end characters of the range with a '-'. All classes %x described above can also be used as components in set. All other characters in set represent themselves. For example, [%w_] (or [_%w]) represents all alphanumeric characters plus the underscore, [0-7] represents the octal digits, and [0-7%l%-] represents the octal digits plus the lowercase letters plus the '-' character. The interaction between ranges and classes is not defined. Therefore, patterns like [%a-z] or [a-%%] have no meaning.
      • [^set]: represents the complement of set, where set is interpreted as above.

      对于由单个字母(%a,%c等)表示的所有类,相应的大写字母表示该类的补码.例如,%S代表所有非空格字符.

      For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non-space characters.

      字母,空格和其他字符组的定义取决于当前的语言环境.特别地,类[a-z]可能不等同于%l.

      The definitions of letter, space, and other character groups depend on the current locale. In particular, the class [a-z] may not be equivalent to %l.

      和魔术配对者:

      • * ,它与班级中的0个或多个字符重复匹配.这些重复项将始终与最长的序列匹配;
      • + ,它与班级中的1个或多个字符重复匹配.这些重复项将始终与最长的序列匹配;
      • - ,它还与班级中的0个或多个字符重复匹配.与"*"不同,这些重复项将始终匹配最短的序列;
      • ? ,它与该类中出现0或1个字符相匹配;

      我注意到有一个贪婪 *和一个非贪婪 -修饰符.从我的中间字符串匹配器开始:

      I noticed that there's a greedy *, and a non-greedy - modifier. Since my middle string matcher:

      (%d) (%s) (%s)
      

      似乎一直在吸收文本,直到最后,也许我应该通过将*更改为-:

      seems to be absorbing text until the end, perhaps i should try to make it non-greedy, by changing the * to a -:

      oldPattern = "([%+%-]%d+) (.*)[ %(Reforged from .+%)]?"
      newPattern = "([%+%-]%d+) (.-)[ %(Reforged from .+%)]?"
      

      除了现在无法匹配:

      • 值= +384
      • stat = nil
      • value = +384
      • stat = nil

      我尝试了一个包含除 以外的所有内容的集合,而不是中间组捕获任何" 字符(即 . ) strong> ( :

      Rather than the middle group capturing "any" character (i.e. .), I tried a set that contains everything except (:

      pattern = "([%+%-]%d+) ([^%(]*)( %(Reforged from .+%))?"
      

      然后车轮从货车上脱下

      local pattern = "([%+%-]%d+) ([^%(]*)( %(Reforged from .+%))?"
      local pattern = "([%+%-]%d+) ((^%()*)( %(Reforged from .+%))?"
      local pattern = "([%+%-]%d+) (%a )+)[ %(Reforged from .+%)]?"
      

      我以为我和……很近

      local pattern = "([%+%-]%d+) ([%a ]+)[ %(Reforged from .+%)]?"
      

      捕获

      - value = "+385"
      - stat = "Critical Strike "  (notice the trailing space)
      

      所以这是我把头撞在枕头上睡觉的地方.我不敢相信我在这个正则表达式上花了四个小时....模式.

      So this is where I bang my head against the pillow and go to sleep; I can't believe I've spent four hours on this regex....pattern.

      @NicolBolas使用伪正则表达式语言定义的所有可能字符串的集合为:

      @NicolBolas The set of all possible strings, defined using a pseudo-regular expression language, are:

      +%d %s (Reforged from %s)
      

      其中

      • + 代表加号(+)减号"(-)
      • %d 代表任何拉丁数字字符(例如0..9)
      • %s 代表任何拉丁大写或小写字母或嵌入的空格(例如A-Za-z)
      • 其余字符是文字​​.
      • + represents either the Plus Sign (+) or the "Minus Sign" (-)
      • %d represents any latin digit character (e.g. 0..9)
      • %s represents any latin uppercase or lowercase letters, or embedded spaces (e.g. A-Za-z)
      • the remaining characters are literals.

      如果我不得不编写一个正则表达式,显然可以尝试做我想要做的事情:

      If i had to write a regular expression that obviously tries to do what i want:

      \+\-\d+ [\w\s]+( \(Reforged from [\w\s]+\))?
      

      但是,如果我对它的解释不够充分,我可以为您提供几乎所有可能在野外遇到的所有价值的几乎完整清单.

      But I can give you near practically complete list of all values I'm likely to encounter in the wild if I didn't explain it well enough.

      • +123 Parry 正数,单个单词
      • +123 Critical Strike 正数,两个单词
      • -123 Parry 负数,单个单词
      • -123 Critical Strike 负数,两个单词
      • +123 Parry (Reforged from Dodge) 正数,单个单词,带有单个单词的可选后缀
      • +123 Critical Strike (Reforged from Dodge) 正数,两个单词,两个单词组成的可选后缀
      • -123 Parry (Reforged from Hit Chance) 负数,一个单词,两个单词组成的可选后缀
      • -123 Critical Strike (Reforged from Hit Chance) 负数,两个单词,两个单词组成的可选后缀
      • +123 Parry positive number, single word
      • +123 Critical Strike positive number, two words
      • -123 Parry negative number, single word
      • -123 Critical Strike negative number, two words
      • +123 Parry (Reforged from Dodge) positive number, single word, optional suffix present with single word
      • +123 Critical Strike (Reforged from Dodge) positive number, two words, optional suffix present with two words
      • -123 Parry (Reforged from Hit Chance) negative number, single word, optional suffix present with two words
      • -123 Critical Strike (Reforged from Hit Chance) negative number, two words, optional suffix present with two words

      存在奖金模式,这些模式也很可能会匹配:

      There are bonus patterns it would seem obvious that the patterns would also match:

      • +1234 Critical Strike Chance 四位数字,三个单词
      • +12345 Mount and run speed increase 五位数字,五个单词
      • +123456 Mount and run speed increase 六位数字,五个单词
      • -1 MoUnT aNd RuN sPeEd InCrEaSe 一位数字,五个单词
      • -1 HiT (Reforged from CrItIcAl StRiKe ChAnCe) 负一位数字,一个单词,带有3个单词的可选后缀
      • +1234 Critical Strike Chance four digit number, three words
      • +12345 Mount and run speed increase five digit number, five words
      • +123456 Mount and run speed increase six digit number, five words
      • -1 MoUnT aNd RuN sPeEd InCrEaSe one digit number, five words
      • -1 HiT (Reforged from CrItIcAl StRiKe ChAnCe) negative one digit number, one word, optional suffix present with 3 words

      尽管理想模式应该与上述奖励条目匹配,但它没有.

      And while the ideal pattern should match the above bonus entries, it does not have to.

      实际上,我尝试解析的所有数字" 都将被本地化,例如:

      In reality all "numbers" i am attempting to parse out will be localized, e.g.:

      • +123,456英文(英语)
      • +123.456在德国(de-DE)
      • +123'456法语(fr-CA)
      • +123 456爱沙尼亚语(et-EE)
      • +1,23,456以阿萨姆语(as-IN)
      • +123,456 in English (en-US)
      • +123.456 in Germany (de-DE)
      • +123'456 in French (fr-CA)
      • +123 456 in Estonian (et-EE)
      • +1,23,456 in Assamese (as-IN)

      任何答案都必须试图说明这些本地化问题.您不知道将要显示数字的语言环境,这就是为什么从该问题中删除了数字本地化的原因.您必须严格假设数字包含plus signhyphen minus和拉丁数字09.我已经知道如何解析本地化的数字.这个问题是关于尝试将可选后缀与贪婪模式解析器进行匹配.

      Any answer must not attempt to account for these localization issues. You do not know the locale a number will be presented from, that is why the number localization has been removed from the question. You must strictly assume that numbers contain plus sign, hyphen minus, and latin digits 0 through 9. I already know how to parse localized numbers. This question is about trying to match the optional suffix with a greedy pattern parser.

      修改:您实际上不必尝试处理本地化号码.在某种程度上尝试在不知道语言环境的情况下处理它们是错误的.例如,我没有包括所有可能的数字本地化.再说一遍:我不知道将来可能存在哪些将来的本地化.

      Edit: You really didn't have to try to handle localized number. At some level trying to handle them, without knowing the locale, is wrong. For example, I didn't include all possible localizations of numbers. For another: I don't know what future localizations might exist in the future.

      推荐答案

      嗯,我没有安装Lua4,但是这种模式在Lua5下有效.我希望它也适用于Lua4.

      Hmm I don't have Lua4 installed but this pattern works under Lua5. I would expect it to work for Lua4 as well.

      更新1 :由于已指定其他要求(本地化),因此我调整了模式和测试以反映这些要求.

      Update 1: Since additional requirements have been specified (localization) I've adapted the pattern and the tests to reflect these.

      更新2 :更新了模式和测试以处理包含@IanBoyd在注释中提到的数字的附加文本类.添加了说明 字符串模式.

      Update 2: Updated the pattern and tests to deal with an additional class of text containing a number as mentioned by @IanBoyd in the comments. Added an explanation of the string pattern.

      更新3 :针对问题的最后一次更新中提到的分别处理本地化号码的情况,添加了变体.

      Update 3: Added variation for the case where the localized number is dealt with separately as mentioned in the last update to the question.

      尝试:

      "(([%+%-][',%.%d%s]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"
      

      或(不尝试验证数字本地化令牌)-只需取一些不是模式结尾处带有数字标记的字母:

      or (no attempt to validate number localization tokens) - just take anything which is not a letter with a digit sentinel at the end of the pattern:

      "(([%+%-][^%a]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"
      

      以上两种模式都不打算用科学计数法处理数字(例如:1.23e + 10)

      Neither of the patterns above are meant to deal with numbers in scientific notation (e.g: 1.23e+10)

      Lua5测试(编辑进行清理-测试变得混乱):

      Lua5 test (Edited to clean up - tests getting cluttered):

      function test(tab, pattern)
         for i,v in ipairs(tab) do
           local f1, f2, f3, f4 = v:match(pattern)
           print(string.format("Test{%d} - Whole:{%s}\nFirst:{%s}\nSecond:{%s}\nThird:{%s}\n",i, f1, f2, f3, f4))
         end
       end
      
       local pattern = "(([%+%-][',%.%d%s]-[%d]+)%s*([%a]+[^%(^%)]+[%a]+)%s*(%(?[%a%s]*%)?))"
       local testing = {"+123 Parry",
         "+123 Critical Strike",
         "-123 Parry",
         "-123 Critical Strike",
         "+123 Parry (Reforged from Dodge)",
         "+123 Critical Strike (Reforged from Dodge)",
         "-123 Parry (Reforged from Hit Chance)",
         "-123 Critical Strike (Reforged from Hit Chance)",
         "+122384    Critical    Strike      (Reforged from parry chance)",
         "+384 Critical Strike ",
         "+384Critical Strike (Reforged from parry chance)",
         "+1234 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
         "+12345 Mount and run speed increase (Reforged from CrItIcAl StRiKe ChAnCe)",
         "+123456 Mount and run speed increase (Reforged from CrItIcAl StRiKe ChAnCe)",
         "-1 MoUnT aNd RuN sPeEd InCrEaSe (Reforged from CrItIcAl StRiKe ChAnCe)",
         "-1 HiT (Reforged from CrItIcAl StRiKe ChAnCe)",
         "+123,456 +1234 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
         "+123.456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
         "+123'456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
         "+123 456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
         "+1,23,456 Critical Strike Chance (Reforged from CrItIcAl StRiKe ChAnCe)",
         "+9 mana every 5 sec",
         "-9 mana every 20 min (Does not occurr in data but gets captured if there)"}
       test(testing, pattern)
      

      这是模式的细分:

      local explainPattern =  
         "(" -- start whole string capture
         ..
         --[[
         capture localized number with sign - 
         take at first as few digits and separators as you can 
         ensuring the capture ends with at least 1 digit
         (the last digit is our sentinel enforcing the boundary)]]
         "([%+%-][',%.%d%s]-[%d]+)" 
         ..
         --[[
         gobble as much space as you can]]
         "%s*"
         ..
         --[[
         capture start with letters, followed by anything which is not a bracket 
         ending with at least 1 letter]]
         "([%a]+[^%(^%)]+[%a]+)"
         ..
         --[[
         gobble as much space as you can]]
         "%s*"
         ..
         --[[
         capture an optional bracket
         followed by 0 or more letters and spaces
         ending with an optional bracket]]
         "(%(?[%a%s]*%)?)"
         .. 
         ")" -- end whole string capture
      

      这篇关于Lua中的贪婪/非贪婪模式匹配和可选后缀的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆