美丽的汤的findAll doen't它们全部找到 [英] Beautiful Soup findAll doen't find them all

查看:191
本文介绍了美丽的汤的findAll doen't它们全部找到的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析网站,并获得与BeautifulSoup.findAll一些信息,但它不容易找到他们所有的..我使用python3

i'm trying to parse a website and get some info with BeautifulSoup.findAll but it doesn't find them all.. I'm using python3

在code是这个

#!/usr/bin/python3

from bs4 import BeautifulSoup
from urllib.request import urlopen

page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())

manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)

for manga in manga_img:
    print (manga['href'])

只打印他们的一半...

it only prints the half of them...

推荐答案

不同的HTML解析器处理不同的断HTML。该网页提供HTML破碎,而 LXML 解析器不处理它非常好:

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml parser is not dealing very well with it:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> len(soup.findAll('a', {'class' : 'manga_img'}))
18

标准库 html.parser 与这个特定的页面少一些麻烦:

The standard library html.parser has less trouble with this specific page:

>>> soup = BeautifulSoup(r.text, 'html.parser')
>>> len(soup.findAll('a', {'class' : 'manga_img'}))
44

翻译,要使用的urllib 您的具体code样品,你会指定由此解析器:

Translating that to your specific code sample using urllib, you would specify the parser thus:

soup = BeautifulSoup(page.read(), 'html.parser')

这篇关于美丽的汤的findAll doen't它们全部找到的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆