“非法多字节序列"指的是“非法多字节序列". Python 3时,BeautifulSoup中出现错误 [英] "illegal multibyte sequence" error from BeautifulSoup when Python 3

查看:755
本文介绍了“非法多字节序列"指的是“非法多字节序列". Python 3时,BeautifulSoup中出现错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

.html保存到本地磁盘,并且我正在使用BeautifulSoup(bs4)对其进行解析.

.html saved to local disk, and I am using BeautifulSoup (bs4) to parse it.

一切正常,直到最近将其更改为Python 3.

It worked all fine until lately it's changed to Python 3.

我在另一台机器Python 2中测试了相同的.html文件,它可以正常工作并返回页面内容.

I tested the same .html file in another machine Python 2, it works and returned the page contents.

soup = BeautifulSoup(open('page.html'), "lxml")

使用Python 3的机器不起作用,它说:

Machine with Python 3 doesn't work, and it says:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence

经过搜索,我尝试了以下操作,但均无济于事:(无论是'r'还是'rb'都没什么大不同)

Searched around and I tried below but neither worked: (be it 'r', or 'rb' doesn't make big difference)

soup = BeautifulSoup(open('page.html', 'r'), "lxml")
soup = BeautifulSoup(open('page.html', 'r'), 'html.parser')
soup = BeautifulSoup(open('page.html', 'r'), 'html5lib')
soup = BeautifulSoup(open('page.html', 'r'), 'xml')

如何使用Python 3解析此html页面?

How can I use Python 3 to parse this html page?

谢谢.

推荐答案

一切正常,直到最近将其更改为Python 3.

It worked all fine until lately it's changed to Python 3.

Python 3默认具有以unicode编码的字符串,因此当您将文件打开为文本时,它将尝试对其进行解码. 另一方面,Python 2使用字节字符串,而是仅按原样返回文件的内容. 尝试将page.html作为字节对象(open('page.html', 'rb'))打开,看看是否适合您.

Python 3 has by default strings encoded in unicode, so when you open a file as text it will try to decode it. Python 2, on the other hand, uses bytestrings, instead and just returns the content of the file as-is. Try opening page.html as a byte object (open('page.html', 'rb')) and see if that works for you.

这篇关于“非法多字节序列"指的是“非法多字节序列". Python 3时,BeautifulSoup中出现错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆