正则表达式匹配中的Umlauts(通过语言环境?) [英] Umlauts in regexp matching (via locale?)

查看:62
本文介绍了正则表达式匹配中的Umlauts(通过语言环境?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很惊讶我无法在正则表达式中匹配德国的变音符号.我尝试了几种方法,其中大多数涉及设置区域设置,但到目前为止仍无济于事.

I'm surprised that I'm not able to match a German umlaut in a regexp. I tried several approaches, most involving setting locales, but up to now to no avail.

locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
re.findall(r'\w+', 'abc def g\xfci jkl', re.L)
re.findall(r'\w+', 'abc def g\xc3\xbci jkl', re.L)
re.findall(r'\w+', 'abc def güi jkl', re.L)
re.findall(r'\w+', u'abc def güi jkl', re.L)

这些版本均未与\w+正确匹配umlaut-u(ü).同样,删除re.L标志或在模式字符串前面加上u(以使其成为unicode)也无济于事.

None of these versions matches the umlaut-u (ü) correctly with \w+. Also removing the re.L flag or prefixing the pattern string with u (to make it unicode) did not help me.

有什么想法吗?标记re.L如何正确使用?

Any ideas? How is the flag re.L used correctly?

推荐答案

是否已尝试使用re.UNICODE标志,如

Have you tried to use the re.UNICODE flag, as described in the doc?

>>> re.findall(r'\w+', 'abc def güi jkl', re.UNICODE)
['abc', 'def', 'g\xc3\xbci', 'jkl']

快速搜索指向此线程,其中提供了一些解释:

A quick search points to this thread that gives some explanation:

re.LOCALE只是将字符传递给基础C库.它 实际上仅适用于每个字符具有1个字节的字节串. UTF-8 将超出ASCII范围的代码点编码为每个字节多个字节 代码点,并且re模块会将这些字节中的每一个视为 单独的字符.

re.LOCALE just passes the character to the underlying C library. It really only works on bytestrings which have 1 byte per character. UTF-8 encodes codepoints outside the ASCII range to multiple bytes per codepoint, and the re module will treat each of those bytes as a separate character.

这篇关于正则表达式匹配中的Umlauts(通过语言环境?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆