几千字的python正则表达式 [英] python regexp for a few thousand words

查看:48
本文介绍了几千字的python正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 python 在字符串中查找某些关键字.字符串是这样的:

I'm trying to find certain keywords in a string with python. The string is something like this:

A was changed from B to C

所有我试图找到的是to C"部分,其中C是数千个单词之一.

此代码构建正则表达式字符串:

This code builds the regexp string:

pre_pad = 'to '
regex_string = None
for i in words:
    if regex_string == None:
        regex_string = '\\b%s%s(?!-)(?!_)\\b' %(pre_pad, i)
    else:
        regex_string = regex_string + '|\\b%s%s(?!-)(?!_)\\b' %(pre_pad, i)

后来我做了:

matches = []
for match in re.finditer(r"%s" %regex_string, text):
        matches.append([match, MATCH_TYPE])

此代码可在 linux 上运行,但在 macos 上崩溃,出现渲染时捕获溢出错误:超出正则表达式代码大小限制"

This code works on linux but crashes on macos with "Caught OverflowError while rendering: regular expression code size limit exceeded"

我意识到 regex_string 很长,这就是问题的原因

I realize that the regex_string is very long and that this is the cause of the problem

print regex_string.__len__()
63574

我该如何解决这个问题,使其始终有效,而与字数无关?

how can I fix this so this will always work, independent of the number of words?

我忘了提到 pre_pad 有时是空的:pre_pad = '',所以并不总是可以先搜索 pre_pad.

I forgot to mention that the pre_pad is sometimes empty: pre_pad = '', so searching for pre_pad first is not always possible.

除此之外,我首先构建整个 regex_string 然后将其与单词匹配的原因是我必须对数千个条目进行匹配.如果我必须每次都重新构建 regex_string,这将导致性能非常差.

In addition to that, the reason why I build the entire regex_string first and then match it against the words is that I have to do this matching for many thousand entries. If I had to build the regex_string every single time again, this would lead to very poor performance.

哦,我需要知道哪个单词匹配.

Oh, and I need to know which word matches.

推荐答案

这不应该是一个你可以用一个巨大的正则表达式解决的任务,并期望比这更好的性能:

This is not supposed to be a task you can solve with a huge regexp and expect better performances than this:

pre_pad = 'to '
matches = []

for i in words:
    regex_string = '\\b%s%s(?!-)(?!_)\\b' % (pre_pad, i)
    for match in re.finditer(r"%s" % regex_string, text):
        matches.append([match, MATCH_TYPE])

此外,如果在分析后您的代码看到链接的正则表达式工作得更快,请在构建它时计算正则表达式字符串长度,并将整个任务拆分为 2、3、10 以避免溢出.

Also if, after profiling your code you see chained regexp work faster calculate your regexp string length while building it and split the full task in 2, 3, 10 to avoid overflow.

附:

print len(regex_string)

更像pythonic...

is more pythonic...

这篇关于几千字的python正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆