Beautiful Soup findAll 不能全部找到 [英] Beautiful Soup findAll doesn't find them all
本文介绍了Beautiful Soup findAll 不能全部找到的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试解析网站并通过 find_all()
方法,但它没有找到它们.
这是代码:
#!/usr/bin/python3从 bs4 导入 BeautifulSoup从 urllib.request 导入 urlopenpage = urlopen(http://mangafox.me/directory/")# 打印 (page.read())汤 = BeautifulSoup (page.read())manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)对于 manga_img 中的漫画:打印(漫画['href'])
它只打印了一半...
解决方案
不同的 HTML 解析器处理损坏的 HTML 的方式不同.该页面提供损坏的 HTML,并且 lxml
解析器没有很好地处理它:
标准库html.parser
这个特定页面的问题较少:
使用 urllib
将其转换为您的特定代码示例,您将指定解析器:
soup = BeautifulSoup(page, 'html.parser') # BeatifulSoup 可以读取
I'm trying to parse a website and get some info with the find_all()
method, but it doesn't find them all.
This is the code:
#!/usr/bin/python3
from bs4 import BeautifulSoup
from urllib.request import urlopen
page = urlopen ("http://mangafox.me/directory/")
# print (page.read ())
soup = BeautifulSoup (page.read ())
manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)
for manga in manga_img:
print (manga['href'])
It only prints half of them...
解决方案
Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml
parser is not dealing very well with it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://mangafox.me/directory/')
>>> soup = BeautifulSoup(r.content, 'lxml')
>>> len(soup.find_all('a', class_='manga_img'))
18
The standard library html.parser
has less trouble with this specific page:
>>> soup = BeautifulSoup(r.content, 'html.parser')
>>> len(soup.find_all('a', class_='manga_img'))
44
Translating that to your specific code sample using urllib
, you would specify the parser thus:
soup = BeautifulSoup(page, 'html.parser') # BeatifulSoup can do the reading
这篇关于Beautiful Soup findAll 不能全部找到的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文