Python中奇怪的前导字符utf-8 / utf-16编码 [英] Weird leading characters utf-8/utf-16 encoding in Python
问题描述
我已经写了一个简化版来演示这个问题。我正在以utf-8和UTF-16格式编码特殊字符。
使用utf-8编码没有问题,当我使用UTF-16编码时,我得到一些奇怪的前导字符。
我试图删除所有尾随和前导字符,但错误仍然存在。
代码示例:
#!/ usr / bin / env python2
#-*-编码:utf-8 -*-
导入chardet
def myEncode(s,pattern):
try:
s.strip()
u = unicode(s,pattern)
print chardet.detect(u.encode(pattern,'strict'))
返回u.encode(pattern,'strict')
除了UnicodeDecodeError如错误:
返回 UnicodeDecodeError:,错误
除外,异常除外作为错误:
返回 ExceptionError:,错误
打印myEncode(r测试!#$%&'()* +-,. /:;< =>?@ [\]?__ {@}〜&€ÄÖÜäöüߣ¥§,
'utf-8')
打印myEncode(r测试!#$%&'()* +-,. /:;< =>?@ [\]?_ {@}〜&€ ÄÖÜäöüߣ¥§,
'utf-16')
输出:
{'confidence':0.99,'language':'','encoding':'utf-8'}
测试!#$%&'()* +-,. /:;< =>?@ [\]?__ {@}〜& €ÄÖÜäöüߣ¥§
{'confidence':1.0,'language':``,'encoding':'UTF-16'}
Test!#$%&'( )* +-,. /:;< =>?@ [\]?_ {@}〜&€ÄÖÜäöüߣ¥§
我要弄错了,我想不出来,我不想将UTF-16转换回utf-8,这对我来说很重要UTF-16。
更新:感谢@tripleee,我的问题的解决方案是定义编码UTF-16le或UTF-16be。再次感谢您的时间和精力。
感谢大家的时间和精力。
该问题的答案由@tripleee提供。
通过定义utf-16le或utf-16be而不是utf-16解决了该问题。 / p>
解决方案示例:
#!/ usr / bin / env python2
#-*-编码:utf-8-*-
import chardet
def myEncode(s,pattern):
试试:
s.strip()
u = un icode(s,pattern)
print chardet.detect(u.encode(pattern,'strict'))
返回u.encode(pattern,'strict')
,但UnicodeDecodeError为err:
返回 UnicodeDecodeError:,错误
,但异常除外err:
返回 ExceptionError:,错误
print myEncode(r Test!# $%&'(()* +-,. /:;< =>?@ [\]?__ {@}〜& €ÄÖÜäöüߣ¥§,
'utf-8')
打印myEncode(r Test!#$%&'()* +-,. / :; < =>?@ [\]?_ {@}〜&€ÄÖÜäöüߣ¥§,
'utf-16be')
输出样本:
{'confidence' :0.99,'language':``,'encoding':'utf-8'}
测试!#$%&'()* +-,. /:;< =>?@ [\]?_ {@@〜&€€ÄÖÜäöüߣ¥§
{'信心':0.99,'语言':'','编码':'utf-8'}
测试!#$%&'()* +-,. /:;< =>?@ [\]?__ {@}〜& €ÄÖÜäöüߣ¥§
I have written a simplified version to demonstrate the problem. I am encoding special characters in utf-8 and UTF-16 format.
With utf-8 encoding there is no problem, when I am encoding with UTF-16 I get some weird leading characters.
I tried to remove all trailing and leading characters but still the error persists.
Sample of code:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import chardet
def myEncode(s, pattern):
try:
s.strip()
u = unicode(s, pattern)
print chardet.detect(u.encode(pattern, 'strict'))
return u.encode(pattern, 'strict')
except UnicodeDecodeError as err:
return "UnicodeDecodeError: ", err
except Exception as err:
return "ExceptionError: ", err
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
'utf-8')
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
'utf-16')
Sample of output:
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
{'confidence': 1.0, 'language': '', 'encoding': 'UTF-16'}
��Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
Where I am going wrong I can not figure it out. I do not want to convert the UTF-16 back to utf-8 it is important for me to keep the format on UTF-16.
Update: Thanks to @tripleee the solution to my problem is to define encoding UTF-16le or UTF-16be. Thanks again for your time and effort.
Thanks in advance for everyone time and effort.
Answer to the problem was given by @tripleee.
By defining utf-16le or utf-16be instead of utf-16 resolved the problem.
Sample of solution:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
import chardet
def myEncode(s, pattern):
try:
s.strip()
u = unicode(s, pattern)
print chardet.detect(u.encode(pattern, 'strict'))
return u.encode(pattern, 'strict')
except UnicodeDecodeError as err:
return "UnicodeDecodeError: ", err
except Exception as err:
return "ExceptionError: ", err
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
'utf-8')
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
'utf-16be')
Sample of output:
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
这篇关于Python中奇怪的前导字符utf-8 / utf-16编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!