utf-8中文件的Python 3.5 UnicodeDecodeError(语言为"ang",古英语) [英] Python 3.5 UnicodeDecodeError for a file in utf-8 (language is 'ang', Old English)

查看:76
本文介绍了utf-8中文件的Python 3.5 UnicodeDecodeError(语言为"ang",古英语)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次使用StackOverflow提出问题,但是这些年来,您已经集体保存了许多我觉得很自在的项目.

this is my first time using StackOverflow to ask a question, but you've collectively saved so many of my projects over the years that I feel at home already.

我正在使用Python3.5和nltk解析完整的古英语语料库,该语料库以77个文本文件和XML文档的形式发布给我,该XML文档将文件序列指定为TEI格式语料库的连续段.这是XML文档标头的相关部分,表明我们实际上正在使用TEI:

I'm using Python3.5 and nltk to parse the Complete Corpus of Old English, which was published to me as 77 text files and an XML doc that designates the file sequence as contiguous segments of a TEI-formatted corpus. Here's the relevant part of the header from the XML doc showing that we are, in fact, working with TEI:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader type="ISBD-ER">
    <fileDesc>

对,因此,作为测试,我只是尝试使用NLTK的MTECorpusReader打开语料库,并使用words()方法证明我能够打开它.我正在通过交互式Python Shell进行所有这些操作,只是为了便于测试.这是我真正在做的所有事情:

Right, so as a test, I'm just trying to use NLTK's MTECorpusReader to open the corpus and use the words() method to prove that I'm able to open it. I'm doing all of this from the interactive Python shell, just for ease of testing. Here's all I'm really doing:

# import the reader method    
import nltk.corpus.reader as reader

# open the sequence of files and the XML doc with the MTECorpusReader    
oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*')

# print the first few words in the corpus to the interactive shell
oecorpus.words()

当我尝试这样做时,会得到以下回溯:

When I try that, I get the following traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/util.py", line 765, in __repr__
    for elt in self:
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 397, in iterate_from
    for tok in piece.iterate_from(max(0, start_tok-offset)):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
    tokens = self.read_block(self._stream)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/mte.py", line 25, in read_block
    return list(filter(lambda x: x is not None, XMLCorpusView.read_block(self, stream, tagspec, elt_handler)))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 307, in read_block
    xml_fragment = self._read_xml_fragment(stream)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 252, in _read_xml_fragment
    xml_block = stream.read(self._BLOCK_SIZE)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1097, in read
    chars = self._read(size)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1367, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1398, in _incr_decode
    return self.decode(bytes, 'strict')
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 59: invalid start byte

因此,由于我是一名英勇的StackOverflowsketeer,所以我确定一个或多个文件已损坏,或者文件中包含某个字符,其中某些字符包含Python的utf-8解码器不知道的字符处理.我可以肯定这个文件的完整性(请相信我),所以我追求

So, as I'm a valiant StackOverflowsketeer, I've determined that either one or more files is corrupted or there's some character in the file(s) that contains a character that Python's utf-8 decoder doesn't know how to handle. I can be fairly certain of this file's integrity (take my word for it), so I'm pursuing

我尝试了以下操作以重新格式化77个文本文件,但没有明显的效果:

I tried the following to reformat the 77 text files with no apparent effect:

for file in loglist:
    bufferfile = open(file, encoding='utf-8', errors='replace')
    bufferfile.close()
loglist = [name for name in os.listdir('.') if os.path.isfile(name)]

所以我的问题是:

1)到目前为止,我的方法是否有意义,或者到目前为止,我是否已在故障排除中搞砸了?

1) Does my approach so far make sense, or have I screwed something up in my troubleshooting so far?

2)基于UTF-8错误显示得非常早的事实(在十六进制位置59)以及我的utf- 8错误替换脚本对问题没有影响?如果我认为这是错误的,那我该如何更好地找出问题所在?

2) Is it fair to conclude at this point that the issue must be with the XML doc, based on the fact that the UTF-8 error shows up very early (at hex position 59) and the fact that my utf-8 error replacement script made no difference to the problem? If I'm wrong to assume that, then how can I better isolate the issue?

3)如果我们可以断定问题出在XML文档上,什么是最好的解决方法?对我来说,尝试找到对应的十六进制字节和ASCII并更改字符是否可行?

3) If we can conclude that the issue is with the XML doc, what's the best way to clear it up? Is it feasible for me to try to find that hex byte and the ASCII it corresponds to and change the character?

预先感谢您的帮助!

推荐答案

您的转换技术不起作用,因为您再也不会读取和写入文件了.

Your conversion technique didn't work because you never read and wrote the file back out again.

0x80在UTF-8或任何iso-8859- *字符集中不是有效字节.在Windows代码页中有效,但是只有Unicode可以支持旧英语字符,因此您的数据非常混乱.

0x80 is not a valid byte in UTF-8 or any iso-8859-* character set. It is valid in Windows codepages, but only Unicode can support Old English characters, so you have some very broken data.

要使用错误的字节转换UTF-8,请执行以下操作:

To convert UTF-8 with bad bytes do:

with open('input.txt', 'r', encoding='utf-8', errors='ignore') as input,
        open('output.txt', 'w', encoding='utf-8') as output:

     output.write(input.read())

如果您不关心丢失数据,则可以使用MTECorpusReader上的encoding自变量来逃脱:

If you don't care about losing data, you may get away using the encoding argument on MTECorpusReader:

oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*', encoding='cp1252')

,这将使0x80成为欧元(€)符号.

which will make 0x80 a Euro (€) symbol.

这篇关于utf-8中文件的Python 3.5 UnicodeDecodeError(语言为"ang",古英语)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆