python代码中的显式非法字符序列 [英] Explicitly illegal character sequence in python code
问题描述
我有一个 UTF-8 输入文件,它经常包含非法字符序列.由于它似乎只是那个特定的序列,我想在我的 Python 脚本中用它的适当等效项替换它.
I have an UTF-8 input file which regularly contains an illegal character sequence. Since it only appears to be that specific sequence, I want to replace it with its proper equivalent in my Python script.
这应该很简单,我想:
value = value.replace('\xE2\x80\x3f', u'"'.encode('utf8'))
但是,脚本没有运行 - 相反,它向我抛出了一个错误:
However, the script doesn't run - instead, it throws me an error:
SyntaxError:第 10 行文件 script.py 中的非 ASCII 字符\xe2",但未声明编码;详情见http://www.python.org/peps/pep-0263.html
是否有一种编码允许我将任何字符编码为字符串文字,实质上是告诉 Python 闭嘴,让我使用我想要的任何无效字符?
Is there an encoding that allows me to encode any character into a string literal, essentially telling Python to shut up and let me use whatever invalid character I want?
(注意:我使用的是 Python 2.7)
(Note: I am using Python 2.7)
推荐答案
# -*- coding:utf-8 -*-
value = "What an amazing string \xE2\x80\x3f !!"
value = value.replace('\xE2\x80\x3f', u'"'.encode('utf8'))
print value
之所以有效,是因为 Python2 解释器将输入脚本文件读取为 ASCII 文件,并且不解码 UTF-8 字符.因为你在文件中写入了一个显式的 UTF-8 字符(即 "
),你需要告诉解释器他必须将输入脚本文件作为 UTF-8 文件读取,而不是作为一个ASCII 文件.
The reason this is working is because Python2 interpreter read the input script file as an ASCII file, and doesn't decode UTF-8 characters. Because you write an explicit UTF-8 character into the file (i.e. "
), you need to tell the interpreter that he has to read the input script file as an UTF-8 file, and not as an ASCII file.
另请参阅关于源代码编码的 PEP0263
See also the PEP0263 about source code encodings
这篇关于python代码中的显式非法字符序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!