何时使用 re.compile [英] When to use re.compile

查看:71
本文介绍了何时使用 re.compile的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请耐心等待,我无法包含我的 1,000 多行程序,而且说明中有几个问题.

Bear with me, I can't include my 1,000+ line program, and there are a couple of questions in the description.

所以我要搜索几种类型的模式:

So I have a couple types of patterns I am searching for:

#literally just a regular word
re.search("Word", arg)

#Varying complex pattern
re.search("[0-9]{2,6}-[0-9]{2}-[0-9]{1}", arg)

#Words with varying cases and the possibility of ending special characters 
re.search("Supplier [Aa]ddress:?|Supplier [Ii]dentification:?|Supplier [Nn]ame:?", arg)

#I also use re.findall for the above patterns as well
re.findall("uses patterns above", arg

我总共有大约 75 个,其中一些需要移动到深度嵌套的函数中

I have about 75 of these in total, and some need to be moved to deeply nested functions

我应该在何时何地编译模式?

现在我试图通过编译 main 中的所有内容来改进我的程序,然后将已编译的 RegexObjects 的正确列表传递给使用它的函数.这会提高我的表现吗?

Right now I am trying to improve my program by compiling everything in main, then pass the correct list of compiled RegexObjects to the function that uses it. Would this increase my performance?

执行以下类似操作会提高我的程序速度吗?

re.compile("pattern").search(arg)

编译后的模式是否保留在内存中,所以如果一个函数在其中多次调用,它会跳过编译部分吗?所以我不必在函数之间移动数据.

Does the compiled patterns stay in memory so if a function is called multiple times with this in it would it skip the compiling part? So I wouldn't have to move data from function to function.

如果我移动数据如此之多,是否值得编译所有模式?

有没有更好的方法来匹配没有正则表达式的正则词?

我的代码的简短示例:

import re

def foo(arg, allWords):
   #Does some things with arg, then puts the result into a variable, 
   # this function does not use allWords

   data = arg #This is the manipulated version of arg

   return(bar(data, allWords))


def bar(data, allWords):
   if allWords[0].search(data) != None:
      temp = data.split("word1", 1)[1]
      return(temp)

   elif allWords[1].search(data) != None:
      temp = data.split("word2", 1)[1]
      return(temp)


def main():

   allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

   arg = "This is a very long string from a text document input, the provided patterns might not be word1 in this string but I need to check for them, and if they are there do some cool things word3"

   #This loop runs a couple million times 
   # because it loops through a couple million text documents
   while True:
      data = foo(arg, allWords)

推荐答案

假设 word1, word2 ... 是正则表达式:

Let's say that word1, word2 ... are regexes:

让我们重写这些部分:

allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]

我会为所有模式创建一个正则表达式:

I would create one single regex for all patterns:

allWords = re.compile("|".join(["word1", "word2", "word3"])

要支持带有 | 的正则表达式,您必须将表达式括起来:

To support regexes with | in them, you would have to parenthesize the expressions:

allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])

(当然也适用于标准单词,由于 | 部分,仍然值得使用正则表达式)

(that also works with standard words of course, and it's still worth using regexes because of the | part)

现在这是一个带有硬编码的变相循环:

now this is a disguised loop with each term hardcoded:

def bar(data, allWords):
   if allWords[0].search(data):
      temp = data.split("word1", 1)[1]  # that works only on non-regexes BTW
      return(temp)

   elif allWords[1].search(data):
      temp = data.split("word2", 1)[1]
      return(temp)

可以简单地改写为

def bar(data, allWords):
   return allWords.split(data,maxsplit=1)[1]

在性能方面:

  • 正则表达式在开始时被编译,所以它尽可能快
  • 没有循环或粘贴表达式,或"部分由正则表达式引擎完成,大多数情况下这是一些编译代码:无法在纯 python 中击败它.
  • 比赛&拆分在一个操作中完成

最后一个小问题是正则表达式引擎在内部搜索循环中的所有表达式,这使得它成为O(n) 算法.为了使其更快,您必须预测哪种模式最常见,并将其放在首位(我的假设是正则表达式是不相交的",这意味着文本不能被多个匹配,否则最长的将不得不在较短的之前)

The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n) algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)

这篇关于何时使用 re.compile的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆