“非法多字节序列"指的是“非法多字节序列". Python 3时,BeautifulSoup中出现错误 [英] "illegal multibyte sequence" error from BeautifulSoup when Python 3
问题描述
.html保存到本地磁盘,并且我正在使用BeautifulSoup(bs4)对其进行解析.
.html saved to local disk, and I am using BeautifulSoup (bs4) to parse it.
一切正常,直到最近将其更改为Python 3.
It worked all fine until lately it's changed to Python 3.
我在另一台机器Python 2中测试了相同的.html文件,它可以正常工作并返回页面内容.
I tested the same .html file in another machine Python 2, it works and returned the page contents.
soup = BeautifulSoup(open('page.html'), "lxml")
使用Python 3的机器不起作用,它说:
Machine with Python 3 doesn't work, and it says:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence
经过搜索,我尝试了以下操作,但均无济于事:(无论是'r'还是'rb'都没什么大不同)
Searched around and I tried below but neither worked: (be it 'r', or 'rb' doesn't make big difference)
soup = BeautifulSoup(open('page.html', 'r'), "lxml")
soup = BeautifulSoup(open('page.html', 'r'), 'html.parser')
soup = BeautifulSoup(open('page.html', 'r'), 'html5lib')
soup = BeautifulSoup(open('page.html', 'r'), 'xml')
如何使用Python 3解析此html页面?
How can I use Python 3 to parse this html page?
谢谢.
推荐答案
一切正常,直到最近将其更改为Python 3.
It worked all fine until lately it's changed to Python 3.
Python 3默认具有以unicode编码的字符串,因此当您将文件打开为文本时,它将尝试对其进行解码.
另一方面,Python 2使用字节字符串,而是仅按原样返回文件的内容.
尝试将page.html
作为字节对象(open('page.html', 'rb')
)打开,看看是否适合您.
Python 3 has by default strings encoded in unicode, so when you open a file as text it will try to decode it.
Python 2, on the other hand, uses bytestrings, instead and just returns the content of the file as-is.
Try opening page.html
as a byte object (open('page.html', 'rb')
) and see if that works for you.
这篇关于“非法多字节序列"指的是“非法多字节序列". Python 3时,BeautifulSoup中出现错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!