美丽的汤的findAll doen't它们全部找到 [英] Beautiful Soup findAll doen't find them all

查看：191 发布时间：2016/8/5 18:53:11 python python-3.x beautifulsoup findall

本文介绍了美丽的汤的findAll doen't它们全部找到的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图解析网站，并获得与BeautifulSoup.findAll一些信息，但它不容易找到他们所有的..我使用python3

i'm trying to parse a website and get some info with BeautifulSoup.findAll but it doesn't find them all.. I'm using python3

在code是这个

#!/usr/bin/python3

from bs4 import BeautifulSoup
from urllib.request import urlopen

page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())

manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)

for manga in manga_img:
    print (manga['href'])

只打印他们的一半...

it only prints the half of them...

推荐答案

不同的HTML解析器处理不同的断HTML。该网页提供HTML破碎，而 LXML 解析器不处理它非常好：

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml parser is not dealing very well with it:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> len(soup.findAll('a', {'class' : 'manga_img'}))
18

标准库 html.parser 与这个特定的页面少一些麻烦：

The standard library html.parser has less trouble with this specific page:

>>> soup = BeautifulSoup(r.text, 'html.parser')
>>> len(soup.findAll('a', {'class' : 'manga_img'}))
44

翻译，要使用的urllib 您的具体code样品，你会指定由此解析器：

Translating that to your specific code sample using urllib, you would specify the parser thus:

soup = BeautifulSoup(page.read(), 'html.parser')

这篇关于美丽的汤的findAll doen't它们全部找到的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

美丽的汤的findAll doen't它们全部找到 [英] Beautiful Soup findAll doen't find them all

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

美丽的汤的findAll doen't它们全部找到 [英] Beautiful Soup findAll doen&#39;t find them all

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

美丽的汤的findAll doen't它们全部找到 [英] Beautiful Soup findAll doen't find them all

登录关闭