正则表达式[A-Z]无法识别本地字符 [英] Regex [A-Z] Do Not Recognize Local Characters

查看:188
本文介绍了正则表达式[A-Z]无法识别本地字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经检查了其他问题,并且已经阅读了他们的解决方案,但它们不起作用.我已经测试了可用于非语言环境字符的正则表达式.代码只是简单地查找字符串中的任何大写字母并对其进行一些处理.例如minikŞeker bir kedi将返回kŞe,但是我的代码无法将Ş识别为[A-Z]中的字母.当我按某些人的要求尝试re.LOCALE时,使用re.UNICODE

I've checked other problems and I've read their solutions, they do not work. I've tested the regular expression it works on non-locale characters. Code is simply to find any capital letters in a string and doing some procedure on them. Such as minikŞeker bir kedi would return kŞe however my code do not recognize Ş as a letter within [A-Z]. When I try re.LOCALE as some people request I get error ValueError: cannot use LOCALE flag with a str pattern when I use re.UNICODE

import re
corp = "minikŞeker bir kedi"
pattern = re.compile(r"([\w]{1})()([A-Z]{1})", re.U)
corp = re.sub(pattern, r"\1 \3", corp)
print(corp)

minikSeker bir kedi的作品不适用于minikŞeker bir kedi,并且为re.L引发错误.我遇到的错误是ValueError: cannot use LOCALE flag with a str pattern搜索该错误产生了一些git讨论,但没什么用.

Works for minikSeker bir kedi doesn't work for minikŞeker bir kedi and throws error for re.L. The Error I'm getting is ValueError: cannot use LOCALE flag with a str pattern Searching for it yielded some git discussions but nothing useful.

推荐答案

问题是Ş不在[A-Z]范围内.该范围是所有字符的代码点位于U + 0040和U + 005A(包括U + 00A)的类. (如果使用字节模式,则所有字节都在0x40和0x5A之间.)Ş是U + 0153(例如,假设为latin2,则为0xAA字节).不在那个范围内.

The problem is that Ş is not in the range [A-Z]. That range is the class of all characters whose codepoints lie U+0040 and U+005A (inclusive). (If you were using bytes-mode, it would be all bytes between 0x40 and 0x5A.) And Ş is U+0153 (or, e.g., 0xAA in bytes, assuming latin2). Which isn't in that range.

使用语言环境不会改变这一点.正如 re.LOCALE 所说明的,它的作用是:

And using a locale won't change that. As re.LOCALE explains, all it does is:

根据当前语言环境进行\ w,\ W,\ b,\ B和不区分大小写的匹配.

Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale.

此外,您几乎永远都不想使用re.LOCALE.正如文档所说:

Also, you almost never want to use re.LOCALE. As the docs say:

不建议使用此标志,因为语言环境机制非常不可靠,它一次只能处理一种区域性",并且仅适用于8位语言环境.

The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one "culture" at a time, and it only works with 8-bit locales.

如果您只关心单个脚本,则可以为该脚本构建一个适当范围的类.

If you only care about a single script, you can build a class of the appropriate ranges for that script.

如果要使用 all 脚本,则需要从Unicode字符类(如Lu)中为所有大写字母"构建一个类.不幸的是,Python的re没有直接执行此操作的机制.您可以根据unicodedata中的信息构建一个巨大的类,但这很烦人:

If you want to work with all scripts, you need to build a class out of a Unicode character class like Lu for "all uppercase letters". Unfortunately, Python's re doesn't have a mechanism for doing this directly. You can build a giant class out of the information in unicodedata, but that's pretty annoying:

Lu = '[' + ''.join(chr(c) for c in range(0, 0x10ffff) 
                   if unicodedata.category(chr(c)) == 'Lu') + ']'

然后:

pattern = re.compile(r"([\w]{1})()(" + Lu + r"{1})", re.U)

…或者也许:

pattern = re.compile(rf"([\w]{{1}})()({Lu}{{1}})", re.U)


但是,好消息是re无法指定Unicode类的部分原因是,很长一段时间以来,计划是用新模块替换re,因此建议使用许多新的re的功能被拒绝.但是好消息是,预期的新模块可以作为第三方库 regex 使用.它工作得很好,几乎可以替代re;它的改进太快了,无法将其锁定在较慢的Python发布计划中.如果安装了它,则可以通过以下方式编写代码:


But the good news is that part of the reason re doesn't have any way to specify Unicode classes is that for a long time, the plan was to replace re with a new module, so many suggested new features for re were rejected. But the good news is that the intended new module is available as a third-party library, regex. It works just fine, and is a near drop-in replacement for re; it was just improving too quickly to lock it down to the slower Python release schedule. If you install it, then you can write your code this way:

import regex
corp = "minikŞeker bir kedi"
pattern = regex.compile(r"([\w]{1})()(\p{Lu}{1})", re.U)
corp = regex.sub(pattern, r"\1 \3", corp)
print(corp)

我所做的唯一更改是将re替换为regex,然后使用\p{Lu}而不是[A-Z].

The only change I made was to replace re with regex, and then use \p{Lu} instead of [A-Z].

当然,还有许多其他正则表达式引擎,其中许多还支持Unicode字符类.多数确实遵循某些\p语法的变体. (他们都从Perl复制了它,但是细节有所不同-例如,regex的Unicode类的思想来自unicodedata模块,而PCREPCRE2试图尽可能地接近Perl,等等).

There are, of course, lots of other regex engines out there, and many of them also support Unicode character classes. Most of those that do follow some variation on the same \p syntax. (They all copied it from Perl, but the details differ—e.g., regex's idea of Unicode classes comes from the unicodedata module, while PCRE and PCRE2 attempt to be as close to Perl as possible, and so on.)

这篇关于正则表达式[A-Z]无法识别本地字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆