美丽的汤4 find_all没有找到链接,美味的汤发现3 [英] Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
问题描述
我注意到一个非常恼人的错误:BeautifulSoup4(包: BS4
)经常发现比previous版本较少标签(包: BeautifulSoup
)。
下面是该问题的一个实例reproductible:
进口要求
进口BS4
进口BeautifulSoupR = requests.get('HTTP://word$p$pss.org/download/release-archive/')
S4 = bs4.BeautifulSoup(r.text)
S3 = BeautifulSoup.BeautifulSoup(r.text)打印随着BeautifulSoup 4:{}。格式(LEN(s4.findAll('A')))
打印随着BeautifulSoup 3:{}。格式(LEN(s3.findAll('A')))
输出:
随着BeautifulSoup 4:557
随着BeautifulSoup 3:1701
所不同的是不小,你可以看到。
下面是万一有人模块的精确版本是纳闷:
在[20]:BS4 .__ version__
出[20]:4.2.1在[21]:BeautifulSoup .__ version__
出[21]:3.2.1
您 LXML
安装,这意味着BeautifulSoup 4将使用的是的解析器在标准库 html.parser
选项。
您可以升级LXML 3.2.1(这对我来说会返回1701结果测试页); LXML本身使用的libxml2
和 libxslt上
这可能是太怪在这里。您可能需要升级的的的而不是/为好。请参阅 LXML要求页面;目前推荐的libxml2 2.7.8或更高版本。
或明确指定其他解析器:
S4 = bs4.BeautifulSoup(r.text,'html.parser')
I noticed a really annoying bug: BeautifulSoup4 (package: bs4
) often finds less tags than the previous version (package: BeautifulSoup
).
Here's a reproductible instance of that issue:
import requests
import bs4
import BeautifulSoup
r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)
print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))
Output:
With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701
The difference is not minor as you can see.
Here are the exact versions of the modules in case someone is wondering:
In [20]: bs4.__version__
Out[20]: '4.2.1'
In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'
You have lxml
installed, which means that BeautifulSoup 4 will use that parser over the standard-library html.parser
option.
You can upgrade lxml to 3.2.1 (which for me returns 1701 results for your test page); lxml itself uses libxml2
and libxslt
which may be to blame too here. You may have to upgrade those instead / as well. See the lxml requirements page; currently libxml2 2.7.8 or newer is recommended.
Or explicitly specify the other parser when parsing the soup:
s4 = bs4.BeautifulSoup(r.text, 'html.parser')
这篇关于美丽的汤4 find_all没有找到链接,美味的汤发现3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!