调用locale.strxfrm时Unicode字符不在范围内 [英] Unicode character not in range when calling locale.strxfrm

查看:110
本文介绍了调用locale.strxfrm时Unicode字符不在范围内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在将locale库与unicode输入一起使用时,我遇到一种奇怪的行为.下面是一个最小的工作示例:

>>> x = '\U0010fefd'
>>> ord(x)
1113853
>>> ord('\U0010fefd') == 0X10fefd
True
>>> ord(x) <= 0X10ffff
True
>>> import locale
>>> locale.strxfrm(x)
'\U0010fefd'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character U+110000 is not in range [U+0000; U+10ffff]

我已经在Python 3.3、3.4和3.5上看到了这一点.我在Python 2.7上没有收到错误.

据我所知,我的unicode输入在适当的unicode范围内,所以当使用'en_US.UTF-8'时,strxfrm内部的某种东西似乎将输入移出了范围. /p>

我正在运行Mac OS X,并且此行为可能与 http://bugs.python.org/有关. issue23195 ...,但给我的印象是,该错误只会表现为错误的结果,而不是引发的异常.我无法在我的SLES 11计算机上进行复制,其他人确认他们无法在Ubuntu,Centos或Windows上进行复制.在评论中听到有关其他操作系统的信息可能会很有帮助.

有人可以解释这可能是什么情况吗?

解决方案

在Python 3.x中,函数 wcsxfrm(),它基于当前的LC_COLLATE设置. POSIX标准以这种方式定义了转换:

如果将wcscmp()应用于两个 转换后的宽字符串,它将返回一个大于等于的值 等于或小于0,对应于应用wcscoll()的结果 到相同的两个原始宽字符字符串.

此定义可以多种方式实现,甚至不需要结果字符串可读.

我创建了一个C代码示例来演示其工作原理:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
  wchar_t buf[10];
  wchar_t *in = L"\x10fefd";
  int i;

  setlocale(LC_COLLATE, "en_US.UTF-8");

  printf("in : ");
  for(i=0;i<10 && in[i];i++)
    printf(" 0x%x", in[i]);
  printf("\n");

  i = wcsxfrm(buf, in, 10);

  printf("out: ");
  for(i=0;i<10 && buf[i];i++)
    printf(" 0x%x", buf[i]);
  printf("\n");
}

它将在转换前后打印字符串.

在Linux(Debian Jessie)上运行它的结果是:

in : 0x10fefd
out: 0x1 0x1 0x1 0x1 0x552

在OSX(10.11.1)上运行时,结果是:

in : 0x10fefd
out: 0x103 0x1 0x110000

您可以看到OSX上wcsxfrm()的输出包含字符U + 110000,Python字符串中不允许该字符,因此这是错误的根源.

在Python 2.7上,不会引发错误,因为其 实现基于strxfrm() C函数.

更新:

进一步研究,我发现OSX上en_US.UTF-8的LC_COLLATE定义是la_LN.US-ASCII定义的链接.

$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Oct  1 14:24 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

我在来源中找到了实际的定义来自Apple.文件la_LN.US-ASCII.src的内容如下:

order \
    \x00;...;\xff

第二次更新:

我进一步测试了OSX上的wcsxfrm()功能.使用la_LN.US-ASCII归类,给定宽字符C1..Cn的序列作为输入,输出为具有以下形式的字符串:

W1..Wn \x01 U1..Un

其中

Wx = 0x103 if Cx > 0xFF else Cx+0x3
Ux = Cx+0x103 if Cx > 0xFF else Cx+0x3

使用此算法\x10fefd变为0x103 0x1 0x110000

我已经检查过,并且每个UTF-8语言环境在OSX上都使用此排序规则,因此我倾向于说Apple系统上对UTF-8的排序规则支持已被破坏.所得的排序与通过普通字节比较获得的排序几乎相同,并且具有获得非法Unicode字符的能力.

I am experiencing an odd behavior when using the locale library with unicode input. Below is a minimum working example:

>>> x = '\U0010fefd'
>>> ord(x)
1113853
>>> ord('\U0010fefd') == 0X10fefd
True
>>> ord(x) <= 0X10ffff
True
>>> import locale
>>> locale.strxfrm(x)
'\U0010fefd'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: character U+110000 is not in range [U+0000; U+10ffff]

I have seen this on Python 3.3, 3.4 and 3.5. I do not get an error on Python 2.7.

As far as I can see, my unicode input is within the appropriate unicode range, so it seems that somehow something internal to strxfrm when using the 'en_US.UTF-8' is moving the input out of range.

I am running Mac OS X, and this behavior may be related to http://bugs.python.org/issue23195... but I was under the impression this bug would only manifest as incorrect results, not a raised exception. I cannot replicate on my SLES 11 machine, and others confirm they cannot replicate on Ubuntu, Centos, or Windows. It may be instructive to hear about other OS's in the comments.

Can someone explain what may be happening here under the hood?

解决方案

In Python 3.x, the function locale.strxfrm(s) internally uses the POSIX C function wcsxfrm(), which is based on current LC_COLLATE setting. The POSIX standard define the transformation in this way:

The transformation shall be such that if wcscmp() is applied to two transformed wide strings, it shall return a value greater than, equal to, or less than 0, corresponding to the result of wcscoll() applied to the same two original wide-character strings.

This definition can be implemented in multiple ways, and doesn't even require that the resulting string is readable.

I've created a little C code example to demonstrate how it works:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
  wchar_t buf[10];
  wchar_t *in = L"\x10fefd";
  int i;

  setlocale(LC_COLLATE, "en_US.UTF-8");

  printf("in : ");
  for(i=0;i<10 && in[i];i++)
    printf(" 0x%x", in[i]);
  printf("\n");

  i = wcsxfrm(buf, in, 10);

  printf("out: ");
  for(i=0;i<10 && buf[i];i++)
    printf(" 0x%x", buf[i]);
  printf("\n");
}

It prints the string before and after the transformation.

Running it on Linux (Debian Jessie) this is the result:

in : 0x10fefd
out: 0x1 0x1 0x1 0x1 0x552

while running it on OSX (10.11.1) the result is:

in : 0x10fefd
out: 0x103 0x1 0x110000

You can see that the output of wcsxfrm() on OSX contains the character U+110000 which is not permitted in a Python string, so this is the source of the error.

On Python 2.7 the error is not raised because its locale.strxfrm() implementation is based on strxfrm() C function.

UPDATE:

Investigating further, I see that the LC_COLLATE definition for en_US.UTF-8 on OSX is a link to la_LN.US-ASCII definition.

$ ls -l /usr/share/locale/en_US.UTF-8/LC_COLLATE
lrwxr-xr-x 1 root wheel 28 Oct  1 14:24 /usr/share/locale/en_US.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE

I found the actual definition in the sources from Apple. The content of file la_LN.US-ASCII.src is the following:

order \
    \x00;...;\xff

2nd UPDATE:

I've further tested the wcsxfrm() function on OSX. Using the la_LN.US-ASCII collate, given a sequence of wide character C1..Cn as input, the output is a string with this form:

W1..Wn \x01 U1..Un

where

Wx = 0x103 if Cx > 0xFF else Cx+0x3
Ux = Cx+0x103 if Cx > 0xFF else Cx+0x3

Using this algorithm \x10fefd become 0x103 0x1 0x110000

I've checked and every UTF-8 locale use this collate on OSX, so I'm inclined to say that the collate support for UTF-8 on Apple systems is broken. The resulting ordering is almost the same of the one obtained whith normal byte comparison, with the bonus of the ability to obtain illegal Unicode characters.

这篇关于调用locale.strxfrm时Unicode字符不在范围内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆