Beautiful Soup 4 find_all 找不到 Beautiful Soup 3 找到的链接 [英] Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
问题描述
我注意到一个非常烦人的错误:BeautifulSoup4(包:bs4
)经常发现比以前的版本(包:BeautifulSoup
)更少的标签.
I noticed a really annoying bug: BeautifulSoup4 (package: bs4
) often finds less tags than the previous version (package: BeautifulSoup
).
这是该问题的可重现实例:
Here's a reproductible instance of that issue:
import requests
import bs4
import BeautifulSoup
r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)
print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))
输出:
With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701
如您所见,差异并不小.
The difference is not minor as you can see.
以下是模块的确切版本,以防有人想知道:
Here are the exact versions of the modules in case someone is wondering:
In [20]: bs4.__version__
Out[20]: '4.2.1'
In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'
推荐答案
您已经安装了 lxml
,这意味着 BeautifulSoup 4 将使用 那个 解析器而不是标准库html.parser
选项.
You have lxml
installed, which means that BeautifulSoup 4 will use that parser over the standard-library html.parser
option.
您可以将 lxml 升级到 3.2.1(这对我来说为您的测试页面返回 1701 个结果);lxml 本身使用 libxml2
和 libxslt
这可能也是这里的罪魁祸首.您可能还需要升级那些.请参阅lxml 要求页面;目前推荐使用 libxml2 2.7.8 或更新版本.
You can upgrade lxml to 3.2.1 (which for me returns 1701 results for your test page); lxml itself uses libxml2
and libxslt
which may be to blame too here. You may have to upgrade those instead / as well. See the lxml requirements page; currently libxml2 2.7.8 or newer is recommended.
或者在解析汤时显式指定另一个解析器:
Or explicitly specify the other parser when parsing the soup:
s4 = bs4.BeautifulSoup(r.text, 'html.parser')
这篇关于Beautiful Soup 4 find_all 找不到 Beautiful Soup 3 找到的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!