如何使用正则表达式使用 Python 按字母顺序查找字符串? [英] How can I use Regex to find a string of characters in alphabetical order using Python?

查看:63
本文介绍了如何使用正则表达式使用 Python 按字母顺序查找字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我面临一个挑战 - 找到字符串中最长的字母字符字符串.例如,abcghiijkyxz"应该导致ghiijk"(是的,i 加倍了).

So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).

我一直在用循环来解决这个问题——迭代整个字符串,然后对每个字符,使用lower和ord开始第二个循环.无需帮助编写该循环.

I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.

然而,有人建议我使用正则表达式来处理这类事情.我的正则表达式很弱(我知道如何获取静态集,我的前瞻知识扩展到知道它们存在).我将如何编写一个正则表达式来展望未来,并按字母顺序检查未来的字符?或者对于这种类型的事情使用 Regex 的建议不切实际?

However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?

普遍的共识似乎是正则表达式对于这类事情确实很糟糕.

The general consensus seems to be that Regex is indeed terrible for this type of thing.

推荐答案

只是为了说明为什么正则表达式对于这类事情实用,这里有一个与 ghiijk 匹配的正则表达式/code> 在您给定的 abcghiijkyxz 示例中.请注意,它还会匹配 abcyxz,因为从技术上讲,它们应该被视为最长字符串按字母顺序排列.不幸的是,您无法单独使用正则表达式确定哪个最长,但这确实为您提供了所有可能性.请注意,此正则表达式适用于 PCRE,不适用于 python 的 re 模块!另外,请注意 python 的 regex 库目前不支持 <代码>(*接受).尽管我还没有测试过,pyre2 包(使用 Cython 的 Google 的 re2 pyre2 的 python 包装器) 声称它支持(*ACCEPT)控制动词,所以这可能目前使用 python 是可能的.

Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.

查看此处使用的正则表达式

((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)

结果:

abc
ghiijk
y
x
z

单个选项的解释,即a+(?(?!b)(*ACCEPT)):

  • a+ 匹配 a (字面意思)一次或多次.这会捕获多​​个相同字符按顺序排列的实例,例如 aa.
  • (?(?!b)(*ACCEPT)) If 子句评估条件.
    • (?!b) if 子句的条件.负前瞻确保后面的内容不是 b.这是因为如果不是b,我们希望下面的控制动词生效.
    • (*ACCEPT) 如果满足上述条件,我们接受当前的解决方案.此控制动词使正则表达式成功结束,跳过模式的其余部分.由于此标记位于捕获组内,因此只有该捕获组在该特定位置成功结束,而父模式继续执行.
    • a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
    • (?(?!b)(*ACCEPT)) If clause evaluating the condition.
      • (?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
      • (*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.

      那么如果条件不满足会发生什么?嗯,这意味着 (?!b) 评估为假.这意味着后面的字符实际上是 b ,因此我们允许匹配(而不是在本例中捕获)继续.请注意,整个模式都包含在 (?:)+ 中,这允许我们匹配连续的选项,直到遇到 (*ACCEPT) 控制动词或行尾.

      So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.

      整个正则表达式的唯一例外是 z.由于它是英文字母表中的最后一个字符(我认为这是这个问题的目标),我们不关心后面是什么,所以我们可以简单地把 z+(?(?!$)(*ACCEPT)),这将确保 z 之后没有任何匹配项.相反,如果您想要匹配 za(圆形字母顺序匹配 - idk,如果这是正确的术语,但对我来说听起来很合适),您可以使用 z+(?(?!a)(*ACCEPT)))+此处所示.

      The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.

      这篇关于如何使用正则表达式使用 Python 按字母顺序查找字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆