Beautiful Soup默认解码字符集? [英] Beautiful Soup default decode charset?

查看:81
本文介绍了Beautiful Soup默认解码字符集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量具有不同编码的网页,我尝试使用Beautiful Soup对其进行解析.

I have a huge set of web pages with different encodings, and I try to parse it using Beautiful Soup.

我注意到,BS使用元字符集或xml编码标签来检测编码.但是有些文档的字符集名称中没有这样的标签或拼写错误-BS在所有文档上均失败.我想它的默认猜测是utf-8,这是错误的.幸运的是,所有此类页面(或几乎所有页面)都具有相同的编码.有什么方法可以将其设置为默认值吗?

As I have noticed, BS detects encoding using meta-charset or xml-encoding tags. But there are documents with no such tags or typos in charset name - and BS fails on all of them. I suppose it's default guess is utf-8, which is wrong. Luckily, all such pages (or nearly all of them) have the same encoding. Is there any way to set it as default?

我也尝试过grep charset并首先使用iconv到utf8-它工作得很好,并且提供了完全可读的utf-8编码输出,但是是BS BeautifulSoup(sys.stdin.read())有时(很少,占所有文件的0.05%)使用

I've also tried to grep charset and use iconv to utf8 first - it works nice, and provides perfectly readable utf-8 encoded output, but BS BeautifulSoup(sys.stdin.read()) sometimes (rarely, like 0.05% of all files) randomly fails on it with

UnicodeDecodeError: 'utf8' codec can't decode byte *** in position ***: invalid start byte

在我看来,这里的基本原因是,尽管实际编码已经是utf-8,但元标记仍然声明了前一个,因此BS感到困惑.它在这里的行为确实很奇怪-就像我删除一个或另一个随机字符(例如'-'或'*'等-不会有任何邪恶的奇怪字符)时一样,它可以正常工作-所以我放弃了,我真的希望继续进行本机的Beautiful Soup解码,但速度也要快一点.

The basic reason here, for my mind, is that while actual encoding is already utf-8, meta-tags still state the previous one, so BS is confused. It has really strange behavior here - like it works smoothly when I delete one or another random character (like '-' or '*' etc. - not any wicked strange one) - so I gave up on it, and I really wish to proceed with native Beautiful Soup decoding, while it is also a bit faster.

推荐答案

BeautifulSoup实际上将使用通过字符检测库进行的有根据的猜测.该过程 可能是错误的;删除一个字符确实可以从根本上改变某些类型文档的结果.

BeautifulSoup will indeed use an educated guess using a character detection library. That process can be wrong; removing just one character can indeed radically change the outcome for certain types of documents.

您可以通过指定输入编解码器覆盖此猜测:

You can override this guess by specifying an input codec:

soup = BeautifulSoup(source, from_encoding=codec)

您可以在此处使用异常处理,以仅在解码失败时应用手动编解码器:

You could use exception handling here to only apply the manual codec when decoding failed:

try:
    soup = BeautifulSoup(source)
except UnicodeDecodeError:
    soup = BeautifulSoup(source, from_encoding=codec)

另请参见 编码部分 BeautifulSoup文档中.

Also see the Encodings section of the BeautifulSoup documentation.

这篇关于Beautiful Soup默认解码字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆