Beautifulsoup返回不完整的html [英] Beautifulsoup returns incomplete html

查看：593 发布时间：2020/5/25 1:18:43 python parsing beautifulsoup flickr

本文介绍了Beautifulsoup返回不完整的html的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在读一本关于Python的书.有一个用于家庭作业的小项目: 编写一个程序，该程序可转到Flickr或Imgur之类的照片共享站点，搜索照片类别，然后下载所有生成的图像." 建议仅使用Web浏览器，请求和bs4库.

I am reading a book about Python right now. There is a small project for homework: "Write a program that goes to a photo-sharing site like Flickr or Imgur, searches for a category of photos, and then downloads all the resulting images." It is suggested to use only webbrowser, requests and bs4 libraries.

我无法为Flickr做到.我发现解析器无法进入元素(div class ="interaction-view").在Chrome浏览器中使用检查元素"，我可以看到其中包含一些"div"元素和一个"a"元素.但是，当我使用bs4库时，看不到它.

I cannot do it for Flickr. I found that the parser cannot go inside the element (div class="interaction-view"). Using "Inspect element" in Chrome I can see that there are a few "div" elements inside it and "a" element. However, when I use bs4 library it cannot see it.

我的代码如下:

#!/usr/bin/env python3
# To download photos from Flickr

import requests, bs4

search_name = "spam"
website_name = requests.get('https://www.flickr.com/search/?text='
                       + search_name)
website_name.raise_for_status()
parse_obj = bs4.BeautifulSoup(website_name.text, "html.parser")
elements = parse_obj.select('body #content main .main.search-photos-results \
                .view.photo-list-view.requiredToShowOnServer \
                .view.photo-list-photo-view.requiredToShowOnServer.awake \
                .interaction-view')
print(elements)

仅打印:

[<div class="interaction-view"></div>, <div class="interaction-view"></div>...]

没有任何嵌套的元素，我不明白为什么... 谢谢！

Without any nested elements and I do not understand why... Thank you!

推荐答案

问题是flickr上<div class="interaction-view"></div>的内容仅通过javascript加载.您可以检查一下是否查看页面源代码，是否可以找到:<div class="interaction-view"></div> div标记中没有内容.

The issue is that the content of <div class="interaction-view"></div> on flickr is only loaded via javascript. You can check that if you view the page source, you'll find: <div class="interaction-view"></div> with no content in the div tag.

您需要以某种方式执行javascript.由于beautifulsoup不提供此功能，因此一种解决方案是使用硒. pip install selenium并安装 geckodriver 用于Firefox(在OSX:brew install geckodriver上).然后更改您的代码以使用硒来加载页面:

You need to execute javascript somehow. Since beautifulsoup doesn't offer this, one solution is to use selenium for that. pip install selenium and install geckodriver for firefox (on OSX: brew install geckodriver). Then change your code to use selenium to load the page:

#!/usr/bin/env python3

import requests, bs4
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

search_name = "spam"
url = 'https://www.flickr.com/search/?text=%s' % search_name

browser = webdriver.Firefox()
browser.get(url)
delay = 3
WebDriverWait(browser, delay).until(EC.presence_of_element_located(browser.find_element_by_id('...')))

soup = bs4.BeautifulSoup(browser.page_source, "html.parser")


elements = soup.select('body #content main .main.search-photos-results \
                .view.photo-list-view.requiredToShowOnServer \
                .view.photo-list-photo-view.requiredToShowOnServer.awake \
                .interaction-view')
print(elements)

需要WebDriverWait部分，因此selenium等待解析，直到加载了某个元素.您需要将...更改为您知道会出现的ID.请参阅此答案以查看如何可以通过类来完成.

The WebDriverWait part is needed so selenium waits with parsing until a certain element is loaded. You need to change ... to an id you know will be present. See this answer to check how it can be done with classes.

这篇关于Beautifulsoup返回不完整的html的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Beautifulsoup返回不完整的html [英] Beautifulsoup returns incomplete html

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Beautifulsoup返回不完整的html [英] Beautifulsoup returns incomplete html

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭