在Emacs中使用Python的Unicode转换问题 [英] Unicode conversion issue using Python in Emacs

查看:142
本文介绍了在Emacs中使用Python的Unicode转换问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我正在尝试了解在命令行运行时出现的一些Python脚本行为的差异,而不是作为Emacs elisp函数的一部分运行。脚本看起来像这样(我使用的是Python 2.7.1 BTW):

  import json; t = {Foo:ザ};打印json.dumps(t).decode(unicode_escape)

拿一个包含unicode字符的JSON段,将其转储到它的unicode转义版本,然后将其解码回它的unicode表示。在命令行运行时,转储部分返回:

 '{Foo:\\\\ザ $

打印时如下所示:

 '{Foo:\\\ザ}'

其解码部分如下所示:

  u'{Foo:\\\ザ}' 

打印时如下所示:

  {Foo:ザ} 

ie,至少在支持unicode的终端/控制台(在我的测试台中,一个xterm)的结构的原始字符串表示。在Windows控制台中,输出对于unicode字符不正确,但脚本不会出错。



在Emacs中,转储转换与在命令行上(至少可以用打印确认),但是解码部分可怕的可怕:


文件 ,第1行,
UnicodeEncodeError:'ascii'编解码器不能编码字符u'\\\ザ'在位置9:序数不在范围(128)`




我有一种感觉,我在脚本或Emacs(在我的测试台23.1.1中)缺少一些基本的东西。是否有一些自动魔术部分的打印调用正确的编解码器/语言环境发生在命令行,但不是在Emacs?我已经尝试明确地设置Emacs调用的区域设置(这里是没有json逻辑的存根测试):

 LC_ALL = \en_US.UTF-8\python -c'= u\Fooザ\; print s'

产生相同的异常,而

 LC_ALL = \en_US.UTF- 8 \python -c'import sys; enc = sys.stdout.encoding; print enc'

表示编码为无。



如果我尝试使用以下方式强制转换:

 LC_ALL = \en_US.UTF-8\python -c's = u\Fooザ\;打印s.encode(\utf8\ ,\replace\)'

错误消失,但结果是在非unicode控制台中看到的字符串的乱码版本:

  Fooa?¶

任何想法?



更新:感谢unutbu - b / c区域设置标识下降,该命令需要使用utf8-encode进行显式装饰(请参阅直接使用unicode字符串工作的答案)。在我的情况下,我从 dumps / decode 序列中得到需要的东西,所以我添加额外的必需装饰来实现所需的结果:



import json; t = {Foo:ザ};打印json.dumps(t).decode(unicode_escape)。encode(utf8,replace)



请注意是没有Emacs所需的必要转义的原始Python。



正如您可能从这个问题的原始部分猜出的那样,我使用这个Emacs中的一些JSON格式化逻辑的一部分 - 请参阅我的答案 此问题

解决方案

Python wiki页面PrintFails




当Python没有检测到输出的所需字符集时,
将sys.stdout.encoding设置为None,打印将调用ascii
编解码器。


看起来,当python从elis运行p功能,它不能检测到所需的字符集,因此默认为ascii。所以尝试打印unicode是默认地导致python编码unicode as ascii,这是错误的原因。






更换 u \Fooザザザ u \Foo \\ザ\工作:

 (defun mytest()
(interactive)
(shell-command-on-区域(点)
(点)LC_ALL = \en_US.UTF-8\python -c'= u\Foo\\ザ\; print s.encode \utf8\,\replace\)'nil t))

Cx Ce Mx mytest



  Fooザ


I'm trying to understand the difference in a bit of Python script behavior when run on the command line vs run as part of an Emacs elisp function.

The script looks like this (I'm using Python 2.7.1 BTW):

import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape")

that is, [in general] take a JSON segment containing unicode characters, dumpstring it to it's unicode escaped version, then decode it back to it's unicode representation. When run on the command line, the dumps part of this returns:

'{"Foo": "\\u30b6"}'

which when printed looks like:

'{"Foo": "\u30b6"}'

the decode part of this looks like:

u'{"Foo": "\u30b6"}'

which when printed looks like:

{"Foo": "ザ"}

i.e., the original string representation of the structure, at least in a terminal/console that supports unicode (in my testbed, an xterm). In a Windows console, the output is not correct with respect to the unicode character, but the script does not error out.

In Emacs, the dumps conversion is the same as on the command line (at least as far as confirming with a print), but the decode part blows out with the dreaded:

File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\u30b6' in position 9: ordinal not in range(128)`

I've a feeling I'm missing something basic here with respect to either the script or Emacs (in my testbed 23.1.1). Is there some auto-magic part of print invoking the correct codec/locale that happens at the command line but not in Emacs? I've tried explicitly setting the locale for an Emacs invocation (here's a stub test without the json logic):

"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s'"

produces the same exception, while

"LC_ALL=\"en_US.UTF-8\" python -c 'import sys; enc=sys.stdout.encoding; print enc' "

indicates that the encoding is 'None'.

If I attempt to coerce the conversion using:

"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s.encode(\"utf8\",\"replace\")'"

the error goes away, but the result is the "garbled" version of the string seen in the non-unicode console:

Fooa?¶

Any ideas?

UPDATE: thanks to unutbu -- b/c the locale identification falls down, the command needs to be explicitly decorated with the utf8-encode (see the answer for working directly with a unicode string). In my case, I am getting what is needed from the dumps/decode sequence, so I add the additional required decoration to achieve the desired result:

import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape").encode("utf8","replace")

Note that this is the "raw" Python without the necessary escaping required by Emacs.

As you may have guessed from looking at the original part of this question, I'm using this as part of some JSON formatting logic in Emacs -- see my answer to this question.

解决方案

The Python wiki page, "PrintFails" says

When Python does not detect the desired character set of the output, it sets sys.stdout.encoding to None, and print will invoke the "ascii" codec.

It appears that when python is being run from an elisp function, it can not detect the desired character set, so it is defaulting to "ascii". So trying to print unicode is tacitly causing python to encode the unicode as ascii, which is reason for the error.


Replacing u\"Fooザ\" with u\"Foo\\u30b6\" seems to work:

(defun mytest ()
  (interactive)
  (shell-command-on-region (point)
         (point) "LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Foo\\u30b6\"; print s.encode(\"utf8\",\"replace\")'" nil t))

C-x C-e M-x mytest

yields

Fooザ

这篇关于在Emacs中使用Python的Unicode转换问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆