Python unicode 正则表达式匹配因某些 unicode 字符而失败 - 错误或错误? [英] Python unicode regular expression matching failing with some unicode characters -bug or mistake?

查看:58
本文介绍了Python unicode 正则表达式匹配因某些 unicode 字符而失败 - 错误或错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 Python 2.7.3 中的 re 模块与 Unicode 编码的 Devnagari 文本一起使用.我已将 from __future__ import unicode_literals 添加到我的代码顶部,因此所有字符串文字都应该是 unicode 对象.

I am attempting to use the re module in Python 2.7.3 with Unicode encoded Devnagari text. I have added from __future__ import unicode_literals to the top of my code so all strings literals should be unicode objects.

但是,我在 Python 的正则表达式匹配方面遇到了一些奇怪的问题.例如,考虑这个名字:किशोरी".这是我的一位用户输入的印地语(拼写错误)名称.任何印地语读者都会认出这是一个词.

However, I am running into some odd problems with Python's regex matching. For instance, consider this name: "किशोरी". This is a (mis-spelled) name, in Hindi, entered by one of my users. Any Hindi reader would recognise this as a word.

以下返回一个匹配,因为它应该:

The following returns a match, as it should:

re.search("^[\w\s][\w\s]*","किशोरी",re.UNICODE)

但这不会:

re.search("^[\w\s][\w\s]*$","किशोरी",re.UNICODE)

一些探索表明该字符串中只有一个字符,即字符 0915 (क),被识别为属于 \w 字符类.这是不正确的,因为 Unicode 字符数据库 派生核心属性"上的文件 列出了其他字符(我有没有检查所有)在这个字符串中作为字母 - 确实如此.

Some spelunking revealed that only one character in this string, character 0915 (क), is recognised as falling within the \w character class. This is incorrect, as the Unicode Character Database file on "derived core properties" lists other characters (I have not checked all) in this string as alphabetic ones - as indeed they are.

这只是 Python 实现中的一个错误吗?我可以通过将所有 Devnagari 字母数字字符手动定义为字符范围来解决这个问题,但这会很痛苦.还是我做错了什么?

Is this just a bug in Python's implementation? I could get around this by manually defining all the Devnagari alphanumeric characters as a character range, but that would be painful. Or am I doing something wrong?

推荐答案

re 模块 并且它在 regex 模块:

It is a bug in the re module and it is fixed in the regex module:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import unicodedata
import re
import regex  # $ pip install regex

word = "किशोरी"


def test(re_):
    assert re_.search("^\\w+$", word, flags=re_.UNICODE)

print([unicodedata.category(cp) for cp in word])
print(" ".join(ch for ch in regex.findall("\\X", word)))
assert all(regex.match("\\w$", c) for c in ["a", "\u093f", "\u0915"])

test(regex)
test(re)  # fails

输出显示"किशोरी"中有6个代码点,但只有3个用户感知字符(扩展的字素簇).在字符内部打断一个词是错误的. Unicode 文本分割 说:

The output shows that there are 6 codepoints in "किशोरी", but only 3 user-perceived characters (extended grapheme clusters). It would be wrong to break a word inside a character. Unicode Text Segmentation says:

词边界、行边界和句子边界不应该出现在字形簇内:换句话说,字形簇应该是确定过程的原子单位这些其他边界.

Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.

这里和进一步强调的是我的

单词边界 \b 定义为 \w 到 \W(或反向)的转换"http://docs.python.org/2.7/library/re" rel="noreferrer">文档:

A word boundary \b is defined as a transition from \w to \W (or in reverse) in the docs:

请注意,形式上,\b 被定义为 a \w 和 a 之间的边界\W 字符(或反之亦然),或在 \w 和开头/结尾之间字符串,...

Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, ...

因此,形成单个字符的所有代码点要么是 \w,要么都是 \W.在这种情况下,"किशोरी" 匹配 ^\w{6}$.

Therefore either all codepoints that form a single character are \w or they are all \W. In this case "किशोरी" matches ^\w{6}$.

来自 Python 2 中 \w 的文档:

From the docs for \w in Python 2:

如果设置了 UNICODE,这将匹配字符 [0-9_] 加任何在 Unicode 字符中被归类为字母数字的东西属性数据库<​​/em>.

If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.

Python 3 中:

匹配 Unicode 单词字符;这包括大多数字符可以是任何语言中单词的一部分,以及数字和下划线.

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore.

来自 regex 文档:

'word' 字符的定义(issue #1693050):

Definition of 'word' character (issue #1693050):

单词"字符的定义已针对 Unicode 进行了扩展.它现在符合 Unicode 规范http://www.unicode.org/reports/tr29/.这适用于 \w、\W、\b 和\B.

The definition of a 'word' character has been expanded for Unicode. It now conforms to the Unicode specification at http://www.unicode.org/reports/tr29/. This applies to \w, \W, \b and \B.

根据 unicode.org U+093F (DEVANAGARI VOWEL SIGN I) 是 alnum 和字母,所以 regex 考虑它也是正确的 \w 即使我们遵循定义不是基于词的边界.

According to unicode.org U+093F (DEVANAGARI VOWEL SIGN I) is alnum and alphabetic so regex is also correct to consider it \w even if we follow definitions that are not based on word boundaries.

这篇关于Python unicode 正则表达式匹配因某些 unicode 字符而失败 - 错误或错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆