美丽的汤findAll找不到全部 [英] Beautiful Soup findAll doesn't find them all

查看：118 发布时间：2020/9/20 5:53:54 python html python-3.x beautifulsoup

本文介绍了美丽的汤findAll找不到全部的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试解析一个网站并通过BeautifulSoup.findAll获取一些信息，但是找不到全部..我正在使用python3

I'm trying to parse a website and get some info with BeautifulSoup.findAll but it doesn't find them all.. I'm using python3

代码是这个

#!/usr/bin/python3

from bs4 import BeautifulSoup
from urllib.request import urlopen

page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())

manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)

for manga in manga_img:
    print (manga['href'])

它只打印其中一半...

it only prints the half of them...

推荐答案

不同的HTML解析器对损坏的HTML的处理方式不同.该页面提供了损坏的HTML，并且lxml解析器对此的处理不佳:

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml parser is not dealing very well with it:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18

标准库 html.parser 在此特定问题上的麻烦较小页面:

The standard library html.parser has less trouble with this specific page:

>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44

使用urllib将其转换为您的特定代码示例，您将这样指定解析器:

Translating that to your specific code sample using urllib, you would specify the parser thus:

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading

这篇关于美丽的汤findAll找不到全部的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

美丽的汤findAll找不到全部 [英] Beautiful Soup findAll doesn't find them all

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

美丽的汤findAll找不到全部 [英] Beautiful Soup findAll doesn&#39;t find them all

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

美丽的汤findAll找不到全部 [英] Beautiful Soup findAll doesn't find them all

登录关闭