UnicodeDecodeError'utf-8'编解码器无法解码位置2893上的字节0x92:无效的起始字节 [英] UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

查看:69
本文介绍了UnicodeDecodeError'utf-8'编解码器无法解码位置2893上的字节0x92:无效的起始字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试打开一系列HTML文件,以便使用BeautifulSoup从这些文件的正文中获取文本.我有大约435个我想运行的文件,但我一直收到此错误.

I'm trying to open a series of HTML files in order to get the text from the body of those files using BeautifulSoup. I have about 435 files that I wanted to run through but I keep getting this error.

我尝试将HTML文件转换为文本并打开文本文件,但出现相同的错误...

I've tried converting the HTML files to text and opening the text files but I get the same error...

path = "./Bitcoin"
for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

我想获取HTML文件的源代码,以便可以使用beautifulsoup解析它,但出现此错误

I want to get the source code of the HTML file so I can parse it using beautifulsoup but I get this error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-133-f32d00599677> in <module>
      3 for file in os.listdir(path):
      4     with open(os.path.join(path, file), "r") as fname:
----> 5         txt = fname.read()

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

推荐答案

有多种方法可以处理编码方式未知的文本数据.但是,在这种情况下,由于您打算将数据传递给Beautiful Soup,因此解决方案很简单:不要费心尝试自己解码文件,让Beautiful Soup来做.Beautiful Soup将自动将字节解码为Unicode .

There are various approaches to dealing with text data with unknown encodings. However in this case, as you intend pass the data to Beautiful Soup, the solution is simple: don't bother trying to decode the file yourself, let Beautiful Soup do it. Beautiful Soup will automatically decode bytes to unicode.

在当前代码中,您以文本模式读取文件,这意味着Python会假定该文件已编码为UTF-8,除非您为 open 函数提供了编码参数.如果文件内容无效的UTF-8,则会导致错误.

In your current code, you read the file in text mode, which means that Python will assume that the file is encoded as UTF-8 unless you provide an encoding argument to the open function. This causes an error if the file's contents are not valid UTF-8.

for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

相反,请以二进制模式读取html文件,并将生成的 bytes 实例传递给Beautiful Soup.

Instead, read the html files in binary mode and pass the resulting bytes instance to Beautiful Soup.

for file in os.listdir(path):
    with open(os.path.join(path, file), "rb") as fname:
        bytes_ = fname.read()
soup = BeautifulSoup(bytes_)

FWIW,当前引起问题的文件可能已使用cp1252或类似的Windows 8位编码进行编码.

FWIW, the file currently causing your problem is probably encoded with cp1252 or a similar windows 8-bit encoding.

>>> '’'.encode('cp1252')
b'\x92'

这篇关于UnicodeDecodeError'utf-8'编解码器无法解码位置2893上的字节0x92:无效的起始字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆