UnicodeDecodeError:'utf-8'编解码器无法解码位置34的字节0xe3:无效的继续字节 [英] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 34: invalid continuation byte

查看:495
本文介绍了UnicodeDecodeError:'utf-8'编解码器无法解码位置34的字节0xe3:无效的继续字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用下面的代码在python文件中用波斯语打开一些文本文件:

I wanna open some text file in Persian language in python file with bellow code:

 for line in codecs.open('0001.txt',encoding='UTF-8'):
       lines.appends(line)

但它给了我这个错误:

> Traceback (most recent call last):
  File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 1596, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 974, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/usr/lib/pycharm-community/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/nlpuser/Documents/ms/Work/General_Dataset_creator/BijanKhanReader.py", line 24, in <module>
    for lin in codecs.open('corpuses/markaz/0001.txt',encoding='UTF-8'):
  File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 713, in __next__
    return next(self.reader)
  File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 644, in __next__
    line = self.readline()
  File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 557, in readline
    data = self.read(readsize, firstline=True)
  File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 503, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: invalid continuation byte

此代码出了什么问题?

他的输出是文件:


0001.txt:非ISO扩展ASCII文本,带有CRLF行终止符

0001.txt: Non-ISO extended-ASCII text, with CRLF line terminators


推荐答案

UTF-8具有非常特定的格式,因为该字符可以用一个到四个字节的任意位置表示。

UTF-8 has a very specific format, given that a character can be represented by anywhere from one to four bytes.

如果字符是单字节,它将由 0x00-0x7F 表示。如果用 2 或更多表示,前导字节将以 0xC2到0xF4 开头,然后是一个三个 continuation 字节,范围为 0x80到0xBF

If a character is single-byte, it will be represented by 0x00-0x7F. If it is represented by two or more, the leading byte will start with 0xC2 to 0xF4, followed by one to three continuation bytes, in range of 0x80 to 0xBF.

Python发现了一个字符,该字符位于延续字符的位置(即,主角字符之后的字符之一),但是它是 0xE3 ,这不是合法的延续字符。问题可能出在您的文本文件中,而不是程序中-编码错误或编码错误。

In your case, Python found a character that is in the position of a continuation character (i.e. one of the characters following the lead character), but is 0xE3, which is not a legal continuation character. The problem is likely in your text file, not in your program - either bad encoding, or wrong encoding.

使用 hexdump -C< file> ; xxd< file> 来验证您拥有的确切字节序列和 file< file> 尝试猜测编码,我们也许可以说更多。

Use hexdump -C <file> or xxd <file> to verify what exact sequence of bytes you have and file <file> to try to guess the encoding, and we might be able to say more.

这篇关于UnicodeDecodeError:'utf-8'编解码器无法解码位置34的字节0xe3:无效的继续字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆