Python中奇怪的前导字符utf-8 / utf-16编码 [英] Weird leading characters utf-8/utf-16 encoding in Python

查看：114 发布时间：2020/10/1 0:24:57 python unicode encoding character-encoding

本文介绍了Python中奇怪的前导字符utf-8 / utf-16编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经写了一个简化版来演示这个问题。我正在以utf-8和UTF-16格式编码特殊字符。

使用utf-8编码没有问题，当我使用UTF-16编码时，我得到一些奇怪的前导字符。

我试图删除所有尾随和前导字符，但错误仍然存在。

代码示例：

 ＃！/ usr / bin / env python2 
＃-*-编码：utf-8 -*-
 
导入chardet 
 
 
 def myEncode（s，pattern）：
 try：
 s.strip（）
u = unicode（s，pattern）
 print chardet.detect（u.encode（pattern，'strict'））
返回u.encode（pattern，'strict'）
除了UnicodeDecodeError如错误：
返回 UnicodeDecodeError：，错误
除外，异常除外作为错误：
返回 ExceptionError：，错误
 
打印myEncode（r测试！＃$％&'（）* +-,. /：;< =>？@ [\]？__ {@}〜&€ÄÖÜäöüß£¥§，
'utf-8'）
打印myEncode（r测试！＃$％&'（）* +-,. /：;< =>？@ [\]？_ {@}〜&€ ÄÖÜäöüß£¥§，
'utf-16'）

输出：

  {'confidence'：0.99，'language'：''，'encoding'：'utf-8'} 
测试！＃$％&'（）* +-,. /：;< =>？@ [\]？__ {@}〜& €ÄÖÜäöüß£¥§
 {'confidence'：1.0，'language'：``，'encoding'：'UTF-16'} 
  Test！＃$％&'（ ）* +-,. /：;< =>？@ [\]？_ {@}〜&€ÄÖÜäöüß£¥§

我要弄错了，我想不出来，我不想将UTF-16转换回utf-8，这对我来说很重要UTF-16。

更新：感谢@tripleee，我的问题的解决方案是定义编码UTF-16le或UTF-16be。再次感谢您的时间和精力。

感谢大家的时间和精力。

解决方案

该问题的答案由@tripleee提供。

通过定义utf-16le或utf-16be而不是utf-16解决了该问题。 / p>

解决方案示例：

 ＃！/ usr / bin / env python2 
＃-*-编码：utf-8-*-
 
 import chardet 
 
 
 def myEncode（s，pattern）：
试试：
 s.strip（）
u = un icode（s，pattern）
 print chardet.detect（u.encode（pattern，'strict'））
返回u.encode（pattern，'strict'）
，但UnicodeDecodeError为err： 
返回 UnicodeDecodeError：，错误
，但异常除外err：
返回 ExceptionError：，错误
 
 print myEncode（r Test！＃ $％&'（（）* +-,. /：;< =>？@ [\]？__ {@}〜& €ÄÖÜäöüß£¥§，
'utf-8'）
打印myEncode（r Test！＃$％&'（）* +-,. / :; < =>？@ [\]？_ {@}〜&€ÄÖÜäöüß£¥§，
'utf-16be'）

输出样本：

  {'confidence' ：0.99，'language'：``，'encoding'：'utf-8'} 
测试！＃$％&'（）* +-,. /：;< =>？@ [\]？_ {@@〜&€€ÄÖÜäöüß£¥§
 {'信心'：0.99，'语言'：''，'编码'：'utf-8'} 
测试！＃$％&'（）* +-,. /：;< =>？@ [\]？__ {@}〜& €ÄÖÜäöüß£¥§

I have written a simplified version to demonstrate the problem. I am encoding special characters in utf-8 and UTF-16 format.

With utf-8 encoding there is no problem, when I am encoding with UTF-16 I get some weird leading characters.

I tried to remove all trailing and leading characters but still the error persists.

Sample of code:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-

import chardet


def myEncode(s, pattern):
try:
    s.strip()
    u = unicode(s, pattern)
    print chardet.detect(u.encode(pattern, 'strict'))
    return u.encode(pattern, 'strict')
except UnicodeDecodeError as err:
    return "UnicodeDecodeError: ", err
except Exception as err:
    return "ExceptionError: ", err

print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-8')
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-16')

Sample of output:

{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
{'confidence': 1.0, 'language': '', 'encoding': 'UTF-16'}
��Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§

Where I am going wrong I can not figure it out. I do not want to convert the UTF-16 back to utf-8 it is important for me to keep the format on UTF-16.

Update: Thanks to @tripleee the solution to my problem is to define encoding UTF-16le or UTF-16be. Thanks again for your time and effort.

Thanks in advance for everyone time and effort.

解决方案

Answer to the problem was given by @tripleee.

By defining utf-16le or utf-16be instead of utf-16 resolved the problem.

Sample of solution:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-

import chardet


def myEncode(s, pattern):
    try:
        s.strip()
        u = unicode(s, pattern)
        print chardet.detect(u.encode(pattern, 'strict'))
        return u.encode(pattern, 'strict')
    except UnicodeDecodeError as err:
        return "UnicodeDecodeError: ", err
    except Exception as err:
        return "ExceptionError: ", err

print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-8')
print myEncode(r"""Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§""",
               'utf-16be')

Sample of output:

{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§
{'confidence': 0.99, 'language': '', 'encoding': 'utf-8'}
Test !"#$%&'()*+-,./:;<=>?@[\]?_{@}~& € ÄÖÜ äöüß £¥§

这篇关于Python中奇怪的前导字符utf-8 / utf-16编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python中奇怪的前导字符utf-8 / utf-16编码 [英] Weird leading characters utf-8/utf-16 encoding in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python中奇怪的前导字符utf-8 / utf-16编码 [英] Weird leading characters utf-8/utf-16 encoding in Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭