基于 Sublime Text 3 的 Python 2.7 不打印“\uFFFD"字符 [英] Python 2.7 build on Sublime Text 3 doesn't print the '\uFFFD' character

查看:86
本文介绍了基于 Sublime Text 3 的 Python 2.7 不打印“\uFFFD"字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是基于 Sublime Text 3 的 Python 2.7,但在打印时遇到问题.
在某些情况下,对于 '\uFFFD' - 'REPLACEMENT CHARACTER' 的输出非常混乱.

I'm using Python 2.7 build on Sublime Text 3 and have an issue with printing out.
In some cases I get a pretty confusing output for '\uFFFD' - the 'REPLACEMENT CHARACTER'.

例如:

print u'\ufffd' # should be '�' - the 'REPLACEMENT CHARACTER'
print u'\u0061' # should be 'a'
-----------------------------------------------------
[Finished in 0.1s]

倒序后:

print u'\u0061' 
print u'\ufffd'
-----------------------------------------------------
a
�
[Finished in 0.1s]

所以,Sublime 可以打印出 ' ' 字符,但由于某种原因在第一种情况下不能这样做.
而且输出对语句顺序的依赖似乎很奇怪.

So, Sublime can printout the '�' character, but for some reason doesn't do it in the 1st case.
And the dependence of the output on the order of statements seems quite strange.

替换字符的问题通常会导致非常不可预测的打印输出行为.
例如,我想打印出带有错误替换的解码字节:

The problem with replacement char leads to very unpredictable printout behavior in general.
For example, I want to printout decoded bytes with error replacement:

cp1251_bytes = '\xe4\xe0' # 'да' in cp1251 
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
��
[Finished in 0.1s]

让我们替换字节:

cp1251_bytes = '\xed\xe5\xf2' # 'нет' in cp1251
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
[Finished in 0.1s]

再添加一个打印语句:

cp1251_bytes = '\xed\xe5\xf2' # 'нет' in cp1251 
print cp1251_bytes.decode('cp1251') 
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
нет
���
[Finished in 0.1s]

<小时>

以下是一些其他测试用例的实现说明:


Below is the illustration of implementation some other test cases:

总结,在描述的打印输出行为中有以下模式:

  • 这取决于打印语句中 '\ufffd' 字符的偶数/奇数
  • 这取决于打印语句的顺序
  • 这取决于具体的构建运行

    Summarizing, there are the following patterns in the described printout behavior:

  • it depends on the even/odd number of '\ufffd' chars in print statement
  • it depends on the order of print statements
  • it depends on the specific build run

  • 为什么会发生这种情况?
  • 如何解决问题?

  • Why does this happen?
  • How to fix the problem?


    我的 Python 2.7 sublime-build 文件:


    My Python 2.7 sublime-build file:

    {   
        "cmd": ["C:\\_Anaconda3\\envs\\python27\\python", "-u", "$file"],
        "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
        "selector": "source.python",
        "env": {"PYTHONIOENCODING": "utf-8"}
    }
    

    Python 2.7 与 Anaconda 分开安装,行为完全相同.

    With Python 2.7 installed separately from Anaconda the behavior is exactly the same.

    推荐答案

    我已经重现了您的问题,并且我已经找到了一个无论如何都可以在我的平台上运行的解决方案:删除 -u 来自 cmd 构建配置选项的标志.

    I've reproduced your problem and I've found a solution that works on my platform anyhow: Remove the -u flag from your cmd build config option.

    我不是 100% 确定为什么会这样,但它似乎是由于控制台解释包含多字节字符的无缓冲数据流而导致的不良交互.这是我发现的:

    I'm not 100% sure why that works, but it seems to be a poor interaction resulting from the console interpreting an unbuffered stream of data containing multi-byte characters. Here's what I've found:

    • -u 选项 将 Python 的输出切换到无缓冲
    • 这个问题根本不是替换字符所特有的.我对あ"(U+3042)等其他字符也有类似的行为.
    • 其他编码也会出现类似的不良结果.设置 "env": {"PYTHONIOENCODING": "utf-16be"} 导致 print u'\u3042' 输出 0B.立>
    • The -u option switches Python's output to unbuffered
    • This problem is not at all specific to the replacement character. I've gotten similar behaviour with other characters like "あ" (U+3042).
    • Similar bad results happen with other encodings. Setting "env": {"PYTHONIOENCODING": "utf-16be"} results in print u'\u3042' outputting 0B.

    编码设置为 UTF-16BE 的最后一个示例说明了我的想法.控制台一次接收一个字节,因为输出是无缓冲的.所以它首先接收 0x30 字节.控制台然后确定这不是有效的 UTF-16BE 并决定回退到 ASCII 并因此输出 0.当然,它紧接着接收下一个字节并按照相同的逻辑输出B.

    That last example with the encoding set to UTF-16BE illustrates what I think is going on. The console is receiving one byte at a time because the output is unbuffered. So it receives the 0x30 byte first. The console then determines this is not valid UTF-16BE and decides instead to fallback to ASCII and thus outputs 0. It of courses receives the next byte right after and follows the same logic to output B.

    使用 UTF-8 编码时,控制台接收的字节不可能被解释为 ASCII,所以我相信控制台在正确解释无缓冲流方面做得更好,但它仍然遇到困难你的问题指出了这一点.

    With the UTF-8 encoding, the console receives bytes that can't possibly be interpreted as ASCII, so I believe the console is doing a slightly better job at properly interpreting the unbuffered stream, but it is still running into the difficulties that your question points out.

    这篇关于基于 Sublime Text 3 的 Python 2.7 不打印“\uFFFD"字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆