Beautiful Soup 4 find_all 找不到 Beautiful Soup 3 找到的链接 [英] Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds

查看:29
本文介绍了Beautiful Soup 4 find_all 找不到 Beautiful Soup 3 找到的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到一个非常烦人的错误:BeautifulSoup4(包:bs4)经常发现比以前的版本(包:BeautifulSoup)更少的标签.

I noticed a really annoying bug: BeautifulSoup4 (package: bs4) often finds less tags than the previous version (package: BeautifulSoup).

这是该问题的可重现实例:

Here's a reproductible instance of that issue:

import requests
import bs4
import BeautifulSoup

r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)

print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

输出:

With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701

如您所见,差异并不小.

The difference is not minor as you can see.

以下是模块的确切版本,以防有人想知道:

Here are the exact versions of the modules in case someone is wondering:

In [20]: bs4.__version__
Out[20]: '4.2.1'

In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'

推荐答案

您已经安装了 lxml,这意味着 BeautifulSoup 4 将使用 那个 解析器而不是标准库html.parser 选项.

You have lxml installed, which means that BeautifulSoup 4 will use that parser over the standard-library html.parser option.

您可以将 lxml 升级到 3.2.1(这对我来说为您的测试页面返回 1701 个结果);lxml 本身使用 libxml2libxslt 这可能也是这里的罪魁祸首.您可能还需要升级那些.请参阅lxml 要求页面;目前推荐使用 libxml2 2.7.8 或更新版本.

You can upgrade lxml to 3.2.1 (which for me returns 1701 results for your test page); lxml itself uses libxml2 and libxslt which may be to blame too here. You may have to upgrade those instead / as well. See the lxml requirements page; currently libxml2 2.7.8 or newer is recommended.

或者在解析汤时显式指定另一个解析器:

Or explicitly specify the other parser when parsing the soup:

s4 = bs4.BeautifulSoup(r.text, 'html.parser')

这篇关于Beautiful Soup 4 find_all 找不到 Beautiful Soup 3 找到的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆