UnicodeDecodeError'utf-8'编解码器无法解码位置2893上的字节0x92:无效的起始字节 [英] UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte
问题描述
我正在尝试打开一系列HTML文件,以便使用BeautifulSoup从这些文件的正文中获取文本.我有大约435个我想运行的文件,但我一直收到此错误.
I'm trying to open a series of HTML files in order to get the text from the body of those files using BeautifulSoup. I have about 435 files that I wanted to run through but I keep getting this error.
我尝试将HTML文件转换为文本并打开文本文件,但出现相同的错误...
I've tried converting the HTML files to text and opening the text files but I get the same error...
path = "./Bitcoin"
for file in os.listdir(path):
with open(os.path.join(path, file), "r") as fname:
txt = fname.read()
我想获取HTML文件的源代码,以便可以使用beautifulsoup解析它,但出现此错误
I want to get the source code of the HTML file so I can parse it using beautifulsoup but I get this error
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-133-f32d00599677> in <module>
3 for file in os.listdir(path):
4 with open(os.path.join(path, file), "r") as fname:
----> 5 txt = fname.read()
~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte
推荐答案
有多种方法可以处理编码方式未知的文本数据.但是,在这种情况下,由于您打算将数据传递给Beautiful Soup,因此解决方案很简单:不要费心尝试自己解码文件,让Beautiful Soup来做.Beautiful Soup将自动将字节解码为Unicode .
There are various approaches to dealing with text data with unknown encodings. However in this case, as you intend pass the data to Beautiful Soup, the solution is simple: don't bother trying to decode the file yourself, let Beautiful Soup do it. Beautiful Soup will automatically decode bytes to unicode.
在当前代码中,您以文本模式读取文件,这意味着Python会假定该文件已编码为UTF-8,除非您为 open
函数提供了编码参数.如果文件内容无效的UTF-8,则会导致错误.
In your current code, you read the file in text mode, which means that Python will assume that the file is encoded as UTF-8 unless you provide an encoding argument to the open
function. This causes an error if the file's contents are not valid UTF-8.
for file in os.listdir(path):
with open(os.path.join(path, file), "r") as fname:
txt = fname.read()
相反,请以二进制模式读取html文件,并将生成的 bytes
实例传递给Beautiful Soup.
Instead, read the html files in binary mode and pass the resulting bytes
instance to Beautiful Soup.
for file in os.listdir(path):
with open(os.path.join(path, file), "rb") as fname:
bytes_ = fname.read()
soup = BeautifulSoup(bytes_)
FWIW,当前引起问题的文件可能已使用cp1252或类似的Windows 8位编码进行编码.
FWIW, the file currently causing your problem is probably encoded with cp1252 or a similar windows 8-bit encoding.
>>> '’'.encode('cp1252')
b'\x92'
这篇关于UnicodeDecodeError'utf-8'编解码器无法解码位置2893上的字节0x92:无效的起始字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!