重新编译中的Python正则表达式模式最大长度? [英] Python regex pattern max length in re.compile?

查看:33
本文介绍了重新编译中的Python正则表达式模式最大长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在 Python 3 中使用 re.compile 编译一个大模式.

我尝试编译的模式由 500 个小词组成(我想从文本中删除它们).问题是它在大约 18 个单词后停止模式

Python 不会引发任何错误.

我要做的是:

stoplist = map(lambda s: "\\b" + s + "\\b", stoplist)stopstring = '|'.join(停止列表)stopword_pattern = re.compile(stopstring)

停止字符串没问题(所有单词都在),但模式要短得多.它甚至停在一个词的中间!

正则表达式模式有最大长度吗?

解决方案

考虑这个例子:

导入重新stop_list = map(lambda s: "\\b" + str(s) + "\\b", range(1000, 2000))stopstring = "|".join(stop_list)stopword_pattern = re.compile(stopstring)

如果您尝试打印图案,您会看到类似

<预><代码>>>>打印(stopword_pattern)re.compile('\\b1000\\b|\\b1001\\b|\\b1002\\b|\\b1003\\b|\\b1004\\b|\\b1005\\b|\\b1006\\b|\\b1007\\b|\\b1008\\b|\\b1009\\b|\\b1010\\b|\\b1011\\b|\\b1012\\b|\\b1013\\b|\\b1014\\b|\\b1015\\b|\\b1016\\b|\\b1017\\b|\)

这似乎表明该模式不完整.然而,这似乎是 __repr__ 和/或 __str__ 方法对 re.compile 对象的限制.如果您尝试对模式的缺失"部分执行匹配,您会看到它仍然成功:

<预><代码>>>>stopword_pattern.match("1999")<_sre.SRE_Match 对象;跨度=(0,4),匹配=1999")

I try to compile a big pattern with re.compile in Python 3.

The pattern I try to compile is composed of 500 small words (I want to remove them from a text). The problem is that it stops the pattern after about 18 words

Python doesn't raise any error.

What I do is:

stoplist = map(lambda s: "\\b" + s + "\\b", stoplist)
stopstring = '|'.join(stoplist)
stopword_pattern = re.compile(stopstring)

The stopstring is ok (all the words are in) but the pattern is much shorter. It even stops in the middle of a word!

Is there a max length for the regex pattern?

解决方案

Consider this example:

import re
stop_list = map(lambda s: "\\b" + str(s) + "\\b", range(1000, 2000))
stopstring = "|".join(stop_list)
stopword_pattern = re.compile(stopstring)

If you try to print the pattern, you'll see something like

>>> print(stopword_pattern)
re.compile('\\b1000\\b|\\b1001\\b|\\b1002\\b|\\b1003\\b|\\b1004\\b|\\b1005\\b|\\b1006\\b|\\b1007\\b|\\b1008\\b|\\b1009\\b|\\b1010\\b|\\b1011\\b|\\b1012\\b|\\b1013\\b|\\b1014\\b|\\b1015\\b|\\b1016\\b|\\b1017\\b|\)

which seems to indicate that the pattern is incomplete. However, this just seems to be a limitation of the __repr__ and/or __str__ methods for re.compile objects. If you try to perform a match against the "missing" part of the pattern, you'll see that it still succeeds:

>>> stopword_pattern.match("1999")
<_sre.SRE_Match object; span=(0,4), match='1999')

这篇关于重新编译中的Python正则表达式模式最大长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆