美丽的汤4 find_all没有找到链接,美味的汤发现3 [英] Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds

查看:265
本文介绍了美丽的汤4 find_all没有找到链接,美味的汤发现3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到一个非常恼人的错误:BeautifulSoup4(包: BS4 )经常发现比previous版本较少标签(包: BeautifulSoup )。

下面是该问题的一个实例reproductible:

 进口要求
进口BS4
进口BeautifulSoupR = requests.get('HTTP://word$p$pss.org/download/release-archive/')
S4 = bs4.BeautifulSoup(r.text)
S3 = BeautifulSoup.BeautifulSoup(r.text)打印随着BeautifulSoup 4:{}。格式(LEN(s4.findAll('A')))
打印随着BeautifulSoup 3:{}。格式(LEN(s3.findAll('A')))

输出:

 随着BeautifulSoup 4:557
随着BeautifulSoup 3:1701

所不同的是不小,你可以看到。

下面是万一有人模块的精确版本是纳闷:

 在[20]:BS4 .__ version__
出[20]:4.2.1在[21]:BeautifulSoup .__ version__
出[21]:3.2.1


解决方案

LXML 安装,这意味着BeautifulSoup 4将使用的的解析器在标准库 html.parser 选项。

您可以升级LXML 3.2.1(这对我来说会返回1701结果测试页); LXML本身使用的libxml2 libxslt上这可能是太怪在这里。您可能需要升级的的而不是/为好。请参阅 LXML要求页面;目前推荐的libxml2 2.7.8或更高版本。

解析汤时

或明确指定其他解析器:

  S4 = bs4.BeautifulSoup(r.text,'html.parser')

I noticed a really annoying bug: BeautifulSoup4 (package: bs4) often finds less tags than the previous version (package: BeautifulSoup).

Here's a reproductible instance of that issue:

import requests
import bs4
import BeautifulSoup

r = requests.get('http://wordpress.org/download/release-archive/')
s4 = bs4.BeautifulSoup(r.text)
s3 = BeautifulSoup.BeautifulSoup(r.text)

print 'With BeautifulSoup 4 : {}'.format(len(s4.findAll('a')))
print 'With BeautifulSoup 3 : {}'.format(len(s3.findAll('a')))

Output:

With BeautifulSoup 4 : 557
With BeautifulSoup 3 : 1701

The difference is not minor as you can see.

Here are the exact versions of the modules in case someone is wondering:

In [20]: bs4.__version__
Out[20]: '4.2.1'

In [21]: BeautifulSoup.__version__
Out[21]: '3.2.1'

解决方案

You have lxml installed, which means that BeautifulSoup 4 will use that parser over the standard-library html.parser option.

You can upgrade lxml to 3.2.1 (which for me returns 1701 results for your test page); lxml itself uses libxml2 and libxslt which may be to blame too here. You may have to upgrade those instead / as well. See the lxml requirements page; currently libxml2 2.7.8 or newer is recommended.

Or explicitly specify the other parser when parsing the soup:

s4 = bs4.BeautifulSoup(r.text, 'html.parser')

这篇关于美丽的汤4 find_all没有找到链接,美味的汤发现3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆