Python + Selenium firefox webdriver - 从网站中提取图像 [英] Python + Selenium firefox webdriver - pulling out images out of a website

查看:37
本文介绍了Python + Selenium firefox webdriver - 从网站中提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用以下方法从网页中提取图像:Python 2.7 + Selenium(使用 FireFox)+ Beautiful Soup.

页面动态加载,因此,我使用 Selenium 进行屏幕抓取.前端的一切看起来都很棒,但是,当我加载了所有图像并查看 HTML 时,我看不到图像的链接.知道这里会发生什么吗?

网站是

我想要得到的实际上是构成实际页面/画布的图像.我可以看到它们通过(在 Firefox 中使用流量选项),但由于某种原因它们没有出现在 HTML 中.知道这里发生了什么吗?

工作代码:

#import 包从时间导入 gmtime, strftime,sleep, timefrom selenium.webdriver.common.by import By从 selenium.webdriver.support.ui 导入 WebDriverWait从硒导入网络驱动程序从 selenium.webdriver.common.proxy 导入代理,ProxyType从 selenium.webdriver.common.desired_capabilities 导入 DesiredCapabilities#scraping包从 bs4 导入 BeautifulSoupUSAPROXY = "177.84.23.122:3128"def launch_webdriver(PROXY):代理 = 代理PROXY_HOST = PROXY.rpartition(':')[0]PROXY_PORT = PROXY.rpartition(':')[2]fp = webdriver.FirefoxProfile()# Direct = 0, Manual = 1, PAC = 2, AUTODETECT = 4, SYSTEM = 5fp.set_preference("network.proxy.type", 1)fp.set_preference("network.proxy.http",PROXY_HOST)fp.set_preference("network.proxy.http_port",int(PROXY_PORT))fp.set_preference("network.proxy.ssl",PROXY_HOST)fp.set_preference("network.proxy.ssl_port",int(PROXY_PORT))fp.set_preference("general.useragent.override","whater_useragent")fp.update_preferences()返回 webdriver.Firefox(firefox_profile=fp)定义测试():驱动程序 = launch_webdriver(USAPROXY)driver.set_page_load_timeout(11)driver.get("https://flipp.com/flyers?postal_code=97035")睡觉(15)driver.get("https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad")睡觉(5)my_html = driver.page_source汤 = BeautifulSoup(my_html,'lxml')tags=soup.findAll('img') #只打印3张图片,应该有100张用于标签中的标签:打印标签打印汤.美化()#执行脚本测试()

解决方案

您在 my_html=driver.page_source 中看不到更新的 HTML 的原因是因为 page_source 在您的页面动态加载之前抓取了 HTML.尝试在页面加载后获取 HTML:

my_html = driver.execute_script("返回 document.getElementsByTagName('html')[0].innerHTML")# 或者my_html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')

<小时>

好的,我想我想出了你要找的东西.我找到了一种访问 network 资源并获取浏览器记录的 性能 数据的方法.一旦加载了您想要的页面,调用此函数并传递驱动程序,它应该以您正在寻找的格式返回图像:

def getNetworkImages(driver):图像列表 = []Resources = driver.execute_script("return window.performance.getEntriesByType('resource');")对于资源中的资源:if resource['initiatorType'] == 'img': ImageList.append(resource['name'])对于 ImageList 中的图像:print(image)返回图像列表

注意:这是用 Chrome 64 和 Chromedriver 2.35 测试的.

I am trying to pull out images from a webpage using: Python 2.7 + Selenium (using FireFox) + Beautiful Soup.

The page loads dynamically, hence, I'm using Selenium for screen scraping. Everything looks great on the front end, however, when all the images I loaded, and I look at the HTML, I can't see the links to the images. Any ideas what could be going on here?

Site is https://flipp.com/flyers?postal_code=97035 , and from there, navigate to https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad in order to see the first weekly ad (My working Code is below).

To make things even more weird, I'm able to see that the images ARE loading in the inspector window... But I still can't see them in the HTML. Any idea on whats going on here, and how to grab the updated HTML (after images load?)

Here is the set of images i am able to pull from HTML (by appending jpg). These are just for popup windows for when you hover over the canvas.

What I am trying to get to are actually the images that make up the actual pages/canvas. I can see them come through (using traffic option in firefox), but they are not appearing in HTML for some reason. Any idea whats going on here?

Working code:

#import packages
from time import gmtime, strftime,sleep, time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
#scraping packages
from bs4 import BeautifulSoup


USAPROXY = "177.84.23.122:3128"
def launch_webdriver(PROXY):
    PROXY = PROXY
    PROXY_HOST = PROXY.rpartition(':')[0]
    PROXY_PORT = PROXY.rpartition(':')[2]
    fp = webdriver.FirefoxProfile()
    # Direct = 0, Manual = 1, PAC = 2, AUTODETECT = 4, SYSTEM = 5
    fp.set_preference("network.proxy.type", 1)
    fp.set_preference("network.proxy.http",PROXY_HOST)
    fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
    fp.set_preference("network.proxy.ssl",PROXY_HOST)
    fp.set_preference("network.proxy.ssl_port",int(PROXY_PORT))
    fp.set_preference("general.useragent.override","whater_useragent")    
    fp.update_preferences()
    return webdriver.Firefox(firefox_profile=fp)




def test():
    driver = launch_webdriver(USAPROXY)
    driver.set_page_load_timeout(11)
    driver.get("https://flipp.com/flyers?postal_code=97035")
    sleep(15)
    driver.get("https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad")
    sleep(5)
    my_html = driver.page_source
    soup = BeautifulSoup(my_html,'lxml')
    tags=soup.findAll('img')  #prints only 3 imgs, there should be 100s
    for tag in tags:print tag
    print soup.prettify()
#execute script
test()

解决方案

The reason why you don't see the updated HTML in your my_html=driver.page_source is because the page_source grabs the HTML before your page has dynamically loaded. Try this instead to get the HTML after the page has loaded:

my_html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
# or
my_html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')


EDIT:

Okay, I think I came up with what you are looking for. I found a way to access the network resources and get the performance data that the browser is logging. Call this function and pass the driver once it has loaded the page you want, and it should return the images in the format you're looking for:

def getNetworkImages(driver):
    ImageList = []
    Resources = driver.execute_script("return window.performance.getEntriesByType('resource');")
    for resource in Resources:
        if resource['initiatorType'] == 'img': ImageList.append(resource['name'])
    for image in ImageList: print(image)
    return ImageList

Note: This was tested with Chrome 64 and Chromedriver 2.35.

这篇关于Python + Selenium firefox webdriver - 从网站中提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆