我怎样才能表示这个正则表达式不会得到“坏字符范围"?错误? [英] How can I represent this regex to not get a "bad character range" error?

查看:49
本文介绍了我怎样才能表示这个正则表达式不会得到“坏字符范围"?错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有更好的方法来做到这一点?

Is there a better way to do this?

$ python
Python 2.7.9 (default, Jul 16 2015, 14:54:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-55)] on linux2

Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub(u'[\U0001d300-\U0001d356]', "", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fast/services/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/home/fast/services/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

推荐答案

Python 窄和宽构建(Python 3.3 以下版本)

该错误表明您正在使用窄"(UCS-2) 版本,该版本仅支持最多 65535 的 Unicode 代码点作为一个Unicode 字符"1.码位高于 65536 的字符表示为代理对,这意味着 Unicode 字符串 u'\U0001d300' 在窄结构中由两个Unicode 字符"组成.

Python narrow and wide build (Python versions below 3.3)

The error suggests that you are using "narrow" (UCS-2) build, which only supports Unicode code points up to 65535 as one "Unicode character"1. Characters whose code points are above 65536 are represented as surrogate pairs, which means that the Unicode string u'\U0001d300' consists of two "Unicode character" in narrow build.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import sys; sys.maxunicode
65535
>>> len(u'\U0001d300')
2
>>> [hex(ord(i)) for i in u'\U0001d300']
['0xd834', '0xdf00']

在宽"(UCS-4) 版本中,所有 1114111 个代码点都被识别为 Unicode 字符,因此 Unicode 字符串 u'\U0001d300' 正好包含一个Unicode 字符"/Unicode代码点.

In "wide" (UCS-4) build, all 1114111 code points are recognized as Unicode character, so the Unicode string u'\U0001d300' consists of exactly one "Unicode character"/Unicode code point.

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import sys; sys.maxunicode
1114111
>>> len(u'\U0001d300')
1
>>> [hex(ord(i)) for i in u'\U0001d300']
['0x1d300']

1 我使用Unicode 字符"(在引号中)来指代 Python Unicode 字符串中的一个字符,而不是一个 Unicode 代码点.字符串中Unicode 字符"的数量是字符串的 len().在窄"构建中,一个Unicode 字符"是 UTF-16 的 16 位代码单元,因此一个星形字符将显示为两个Unicode 字符".在宽"构建中,一个Unicode 字符"始终对应一个 Unicode 代码点.

1 I use "Unicode character" (in quotes) to refer to one character in Python Unicode string, not one Unicode code point. The number of "Unicode characters" in a string is the len() of the string. In "narrow" build, one "Unicode character" is a 16-bit code unit of UTF-16, so one astral character will appear as two "Unicode character". In "wide" build, one "Unicode character" always corresponds to one Unicode code point.

问题中的正则表达式在宽"构建中正确编译:

The regex in the question compiles correctly in "wide" build:

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import re; re.compile(u'[\U0001d300-\U0001d356]', re.DEBUG)
in
  range (119552, 119638)
<_sre.SRE_Pattern object at 0x7f9f110386b8>

狭窄的构建

但是,相同的正则表达式在窄"构建中不起作用,因为引擎无法识别代理对.它只是将 \ud834 视为一个字符,然后尝试创建从 \udf00\ud834 的字符范围并失败.

Narrow build

However, the same regex won't work in "narrow" build, since the engine does not recognize surrogate pairs. It just treats \ud834 as one character, then tries to create a character range from \udf00 to \ud834 and fails.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> [hex(ord(i)) for i in u'[\U0001d300-\U0001d356]']
['0x5b', '0xd834', '0xdf00', '0x2d', '0xd834', '0xdf56', '0x5d']

解决方法是使用 与 ECMAScript 中相同的方法,我们将在其中构造正则表达式以匹配表示代码点的代理项.

The workaround is to use the same method as done in ECMAScript, where we will construct the regex to match the surrogates representing the code point.

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import re; re.compile(u'\ud834[\udf00-\udf56]', re.DEBUG)
literal 55348
in
  range (57088, 57174)
<_sre.SRE_Pattern object at 0x6ffffe52210>
>>> input =  u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> input
u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> re.sub(u'\ud834[\udf00-\udf56]', '', input)
u'Sample . Another . Leave alone \U00011000'

使用 regexpu 为 Python 窄构建导出星体平面正则表达式

由于 Python 窄构建中匹配星面字符的构造与 ES5 相同,您可以使用 regexpu,一个将 ES6 中的 RegExp 文字转换为 ES5 的工具,为您完成转换.

Using regexpu to derive astral plane regex for Python narrow build

Since the construction to match astral plane characters in Python narrow build is the same as ES5, you can use regexpu, a tool to convert RegExp literal in ES6 to ES5, to do the conversion for you.

只需在 ES6 中粘贴 等价 正则表达式(注意 u 标志和 \u{hh...h} 语法):

Just paste the equivalent regex in ES6 (note the u flag and \u{hh...h} syntax):

/[\u{1d300}-\u{1d356}]/u

你会得到正则表达式,它可以在 Python 窄构建和 ES5 中使用

and you get back the regex which can be used in both Python narrow build and ES5

/(?:\uD834[\uDF00-\uDF56])/

当您想在 Python 中使用正则表达式时,请注意删除 JavaScript RegExp 文本中的分隔符 /.

Do take note to remove the delimiter / in JavaScript RegExp literal when you want to use the regex in Python.

当范围跨越多个高代理(U+D800 到 U+DBFF)时,该工具非常有用.例如,如果我们要匹配字符范围

The tool is extremely useful when the range spread across multiple high surrogates (U+D800 to U+DBFF). For example, if we have to match the character range

/[\u{105c0}-\u{1cb40}]/u

Python 窄构建和 ES5 中的等效正则表达式是

The equivalent regex in Python narrow build and ES5 is

/(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])/

推导起来相当复杂且容易出错.

which is rather complex and error-prone to derive.

Python 3.3 实现了 PEP 393,消除了窄构建和宽构建之间的区别,Python 从现在开始表现得像宽构建.这完全消除了问题中的问题.

Python 3.3 implements PEP 393, which eliminates the distinction between narrow build and wide build, and Python from now behaves like a wide build. This eliminates the problem in the question altogether.

虽然可以在 Python 窄构建中解决和匹配星体平面字符,但今后最好通过使用 Python 宽构建更改执行环境,或移植代码以用于 Python 3.3 及更高版本.

While it's possible to workaround and match astral plane characters in Python narrow builds, going forward, it's best to change the execution environment by using Python wide builds, or port the code to use with Python 3.3 and above.

对于普通程序员来说,窄构建的正则表达式代码很难阅读和维护,并且在移植到 Python 3 时必须完全重写.

The regex code for narrow build is hard to read and maintain for average programmers, and it has to be completely rewritten when porting to Python 3.

这篇关于我怎样才能表示这个正则表达式不会得到“坏字符范围"?错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆