regex.sub() 给 re.sub() 不同的结果 [英] regex.sub() gives different results to re.sub()

查看:48
本文介绍了regex.sub() 给 re.sub() 不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Python 3.4 中使用 ,这不是实际上是一个错误.

但是,它与您没有遇到的错误 相关,以及您使用了一个您可能不应该使用的标志,所以我将在下面留下我之前的回答,即使他是您问题的正确答案.

<小时>

最近有一个变化(介于 3.4.1 和 3.4.3 之间,介于 2.7.3 和 2.7.8 之间)影响了这一点.在此更改之前,您甚至无法在不引发 OverflowError 的情况下编译该模式.

更重要的是,你为什么使用 re.L?re.L 机制并不意味着对我的语言环境使用 Unicode 规则",它的意思是使用一些未指定的非 Unicode 规则,这些规则只对源自拉丁语 1 的语言环境真正有意义,并且可能不会在 Windows 上正常工作".或者,正如文档所说:

<块引用>

制作\w\W\b\B\s\S 取决于当前的语言环境.不鼓励使用此标志,因为区域设置机制非常不可靠,而且它一次只能处理一种文化";您应该改用 Unicode 匹配,这是 Python 3 中 Unicode (str) 模式的默认设置.

请参阅 bug #22407 和链接的 python-dev 线程,了解最近对此的一些讨论.

如果我删除 re.L 标志,代码现在可以在 3.4.1 上编译得很好.(我在 3.4.1 和 3.4.3 上也都得到了正确"的结果,但这只是巧合;我现在故意不通过狡猾的标志并将其在第一个版本中搞砸,并且仍然不小心没有通过狡猾的旗帜并在第二次把它搞砸,所以它们匹配......)

因此,即使这是一个错误,WONTFIX 也很有可能将其关闭.#22407 的解决方案是在 3.5 中弃用 re.L 用于非 bytes 模式并在 3.6 中将其删除,所以我怀疑有人会关心用它修复错误现在.(更不用说 re 本身在理论上会被 regex 这几十年中的一个……和 IIRC,regex 也弃用了 L 标志,除非你使用 bytes> 模式和 re 兼容模式.)

I work with Czech accented text in Python 3.4.

Calling re.sub() to perform substitution by regex on an accented sentence works well, but using a regex compiled with re.compile() and then calling regex.sub() fails.

Here is the case, where I use the same arguments for re.sub() and regex.sub()

import re

pattern = r'(?<!\*)(Poplatn[ií]\w+ da[nň]\w+)'
flags = re.I|re.L
compiled = re.compile(pattern, flags)
text = 'Poplatníkem daně z pozemků je vlastník pozemku'
mark = r'**\1**' # wrap 1st matching group in double stars

print(re.sub(pattern, mark, text, flags))
# outputs: **Poplatníkem daně** z pozemků je vlastník pozemku
# substitution works

print(compiled.sub(mark, text))
# outputs: Poplatníkem daně z pozemků je vlastník pozemku
# substitution fails

I believe that the reason is accents, because for a non-accented sentence re.sub() and regex.sub() work identically.

But it seems to me like a bug, because passing the same arguments returns different results, which should not happen. This topic is complicated by different platforms and locales, so it may not be reproducible on your system. Here is screenshot of my console.

Do you see any fault in my code, or should I report it as a bug?

解决方案

As Padraic Cunningham figured out, this is not actually a bug.

However, it is related to a bug which you didn't run into, and to you using a flag you probably shouldn't be using, so I'll leave my earlier answer below, even though his is the right answer to your problem.


There's a recent-ish change (somewhere between 3.4.1 and 3.4.3, and between 2.7.3 and 2.7.8) that affects this. Before that change, you can't even compile that pattern without raising an OverflowError.

More importantly, why are you using re.L? The re.L mechanism does not mean "use the Unicode rules for my locale", it means "use some unspecified non-Unicode rules that only really make sense for Latin-1-derived locales and may not work right on Windows". Or, as the docs put it:

Make \w, \W, \b, \B, \s and \S dependent on the current locale. The use of this flag is discouraged as the locale mechanism is very unreliable, and it only handles one "culture" at a time anyway; you should use Unicode matching instead, which is the default in Python 3 for Unicode (str) patterns.

See bug #22407 and the linked python-dev thread for some recent discussion of this.

And if I remove the re.L flag, the code now compiles just fine on 3.4.1. (I also get the "right" results on both 3.4.1 and 3.4.3, but that's just a coincidence; I'm now intentionally not passing the screwy flag and screwing it up in the first version, and still accidentally not passing the screwy flag and screwing it up in the second, so they match…)

So, even if this were a bug, there's a good chance it would be closed WONTFIX. The resolution for #22407 was to deprecate re.L for non-bytes patterns in 3.5 and remove it in 3.6, so I doubt anyone's going to care about fixing bugs with it now. (Not to mention that re itself is theoretically going away in favor of regex one of these decades… and IIRC, regex also deprecated the L flag unless you're using a bytes pattern and re-compatible mode.)

这篇关于regex.sub() 给 re.sub() 不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆