Python中奇怪的前导字符utf-8 / utf-16编码 [英] Weird leading characters utf-8/utf-16 encoding in Python

查看:114
本文介绍了Python中奇怪的前导字符utf-8 / utf-16编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经写了一个简化版来演示这个问题。我正在以utf-8和UTF-16格式编码特殊字符。



使用utf-8编码没有问题,当我使用UTF-16编码时,我得到一些奇怪的前导字符。



我试图删除所有尾随和前导字符,但错误仍然存​​在。



代码示例:

 #!/ usr / bin / env python2 
#-*-编码:utf-8 -*-

导入chardet


def myEncode(s,pattern):
try:
s.strip()
u = unicode(s,pattern)
print chardet.detect(u.encode(pattern,'strict'))
返回u.encode(pattern,'strict')
除了UnicodeDecodeError如错误:
返回 UnicodeDecodeError:,错误
除外,异常除外作为错误:
返回 ExceptionError:,错误

打印myEncode(r测试!#$%&'()* +-,. /:;< =>?@ [\]?__ {@}〜&€ÄÖÜäöüߣ¥§,
'utf-8')
打印myEncode(r测试!#$%&'()* +-,. /:;< =>?@ [\]?_ {@}〜&€ ÄÖÜäöüߣ¥§,
'utf-16')

输出:

  {'confidence':0.99,'language':'','encoding':'utf-8'} 
测试!#$%&'()* +-,. /:;< =>?@ [\]?__ {@}〜& €ÄÖÜäöüߣ¥§
{'confidence':1.0,'language':``,'encoding':'UTF-16'}
Test!#$%&'( )* +-,. /:;< =>?@ [\]?_ {@}〜&€ÄÖÜäöüߣ¥§

我要弄错了,我想不出来,我不想将UTF-16转换回utf-8,这对我来说很重要UTF-16。



更新:感谢@tripleee,我的问题的解决方案是定义编码UTF-16le或UTF-16be。再次感谢您的时间和精力。



感谢大家的时间和精力。

解决方案

该问题的答案由@tripleee提供。



通过定义utf-16le或utf-16be而不是utf-16解决了该问题。 / p>

解决方案示例:

 #!/ usr / bin / env python2 
#-*-编码:utf-8-*-

import chardet


def myEncode(s,pattern):
试试:
s.strip()
u = un icode(s,pattern)
print chardet.detect(u.encode(pattern,'strict'))
返回u.encode(pattern,'strict')
,但UnicodeDecodeError为err:
返回 UnicodeDecodeError:,错误
,但异常除外err:
返回 ExceptionError:,错误

print myEncode(r Test!# $%&'(()* +-,. /:;< =>?@ [\]?__ {@}〜& €ÄÖÜäöüߣ¥§,
'utf-8')
打印myEncode(r Test!#$%&'()* +-,. / :; < =>?@ [\]?_ {@}〜&€ÄÖÜäöüߣ¥§,
'utf-16be')

输出样本:

  {'confidence' :0.99,'language':``,'encoding':'utf-8'} 
测试!#$%&'()* +-,. /:;< =>?@ [\]?_ {@@〜&€€ÄÖÜäöüߣ¥§
{'信心':0.99,'语言':'','编码':'utf-8'}
测试!#$%&'()* +-,. /:;< =>?@ [\]?__ {@}〜& €ÄÖÜäöüߣ¥§


I have written a simplified version to demonstrate the problem. I am encoding special characters in utf-8 and UTF-16 format.

With utf-8 encoding there is no problem, when I am encoding with UTF-16 I get some weird leading characters.

I tried to remove all trailing and leading characters but still the error persists.

Sample of code:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-

import chardet


def myEncode(s, pattern):
try:
    s.strip()
    u = unicode(s, pattern)
    print chardet.detect(u.encode(pattern, 'strict'))
    return u.encode(pattern, 'strict')
except UnicodeDecodeError as err:
    return "UnicodeDecodeError: ", err
except Exception as err:
    return "ExceptionError: ", err

print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-8')
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-16')

Sample of output:

{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
{'confidence': 1.0, 'language': '', 'encoding': 'UTF-16'}
��Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§

Where I am going wrong I can not figure it out. I do not want to convert the UTF-16 back to utf-8 it is important for me to keep the format on UTF-16.

Update: Thanks to @tripleee the solution to my problem is to define encoding UTF-16le or UTF-16be. Thanks again for your time and effort.

Thanks in advance for everyone time and effort.

解决方案

Answer to the problem was given by @tripleee.

By defining utf-16le or utf-16be instead of utf-16 resolved the problem.

Sample of solution:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-

import chardet


def myEncode(s, pattern):
    try:
        s.strip()
        u = unicode(s, pattern)
        print chardet.detect(u.encode(pattern, 'strict'))
        return u.encode(pattern, 'strict')
    except UnicodeDecodeError as err:
        return "UnicodeDecodeError: ", err
    except Exception as err:
        return "ExceptionError: ", err

print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-8')
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-16be')

Sample of output:

{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§

这篇关于Python中奇怪的前导字符utf-8 / utf-16编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆