美丽的汤4 find_all没有找到链接，美味的汤发现3 [英] Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds

查看：265 发布时间：2016/8/5 18:52:42 python web web-scraping beautifulsoup

本文介绍了美丽的汤4 find_all没有找到链接，美味的汤发现3的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我注意到一个非常恼人的错误：BeautifulSoup4（包： BS4 ）经常发现比previous版本较少标签（包： BeautifulSoup ）。

下面是该问题的一个实例reproductible：

 进口要求
进口BS4
进口BeautifulSoupR = requests.get（'HTTP：//word$p$pss.org/download/release-archive/'）
S4 = bs4.BeautifulSoup（r.text）
S3 = BeautifulSoup.BeautifulSoup（r.text）打印随着BeautifulSoup 4：{}。格式（LEN（s4.findAll（'A'）））
打印随着BeautifulSoup 3：{}。格式（LEN（s3.findAll（'A'）））

输出：

 随着BeautifulSoup 4：557
随着BeautifulSoup 3：1701

所不同的是不小，你可以看到。

下面是万一有人模块的精确版本是纳闷：

 在[20]：BS4 .__ version__
出[20]：4.2.1在[21]：BeautifulSoup .__ version__
出[21]：3.2.1

解决方案

您 LXML 安装，这意味着BeautifulSoup 4将使用的是的解析器在标准库 html.parser 选项。

您可以升级LXML 3.2.1（这对我来说会返回1701结果测试页）; LXML本身使用的libxml2 和 libxslt上这可能是太怪在这里。您可能需要升级的的的而不是/为好。请参阅 LXML要求页面;目前推荐的libxml2 2.7.8或更高版本。

解析汤时

或明确指定其他解析器：

  S4 = bs4.BeautifulSoup（r.text，'html.parser'）

I noticed a really annoying bug: BeautifulSoup4 (package: bs4) often finds less tags than the previous version (package: BeautifulSoup).

Here's a reproductible instance of that issue:

import requests
import bs4
import BeautifulSoup

r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)

print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

Output:

With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701

The difference is not minor as you can see.

Here are the exact versions of the modules in case someone is wondering:

In [20]: bs4.__version__
Out[20]: '4.2.1'

In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'

解决方案

You have lxml installed, which means that BeautifulSoup 4 will use that parser over the standard-library html.parser option.

You can upgrade lxml to 3.2.1 (which for me returns 1701 results for your test page); lxml itself uses libxml2 and libxslt which may be to blame too here. You may have to upgrade those instead / as well. See the lxml requirements page; currently libxml2 2.7.8 or newer is recommended.

Or explicitly specify the other parser when parsing the soup:

s4 = bs4.BeautifulSoup(r.text, 'html.parser')

这篇关于美丽的汤4 find_all没有找到链接，美味的汤发现3的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

美丽的汤4 find_all没有找到链接，美味的汤发现3 [英] Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

美丽的汤4 find_all没有找到链接，美味的汤发现3 [英] Beautiful Soup 4 find_all don&#39;t find links that Beautiful Soup 3 finds

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

美丽的汤4 find_all没有找到链接，美味的汤发现3 [英] Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds

登录关闭