Python:无法在网页中使用硒进行下载 [英] Python: Unable to download with selenium in webpage

查看:25
本文介绍了Python:无法在网页中使用硒进行下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目的是从

解决方案

我重新编写了您的脚本,并附有注释,解释了我进行更改的原因.我认为您的主要问题可能是一个糟糕的模仿类型,但是,您的脚本有一个系统问题的日志,这最多会使它变得不可靠.此重写使用显式等待,完全不需要使用 time.sleep(),允许它尽可能快地运行,同时消除网络拥塞引起的错误.

您需要执行以下操作以确保安装了所有模块:

pip install 请求显式 selenium 重试 pyvirtualdisplay

脚本:

#!/usr/bin/pythonfrom __future__ import print_function # 使你的代码可移植导入操作系统导入全局导入压缩文件从上下文库导入上下文管理器进口请求从重试导入重试从显式导入服务员,XPATH,ID从硒导入网络驱动程序from pyvirtualdisplay 导入显示从 selenium.webdriver.common.keys 导入密钥从 selenium.webdriver.support.wait 导入 WebDriverWaitDOWNLOAD_DIR = "/tmp/shKLSE/"def build_profile():profile = webdriver.FirefoxProfile()profile.set_preference('browser.download.folderList', 2)profile.set_preference('browser.download.manager.showWhenStarting', False)profile.set_preference('browser.download.dir', DOWNLOAD_DIR)# 我认为你的 `/zip` mime 类型不正确.这对我有用profile.set_preference('browser.helperApps.neverAsk.saveToDisk','application/vnd.ms-excel,application/zip')返回个人资料# 重试是一种优雅的重试浏览器创建的方式# 虽然你应该将范围缩小到任何实际的例外情况# 重试@retry(异常,尝试= 5,延迟= 3)@contextmanager # 这会将 get_browser 变成上下文管理器def get_browser():# 使用带有 Display 的上下文管理器,因此它会被关闭,即使是# 异常被抛出profile = build_profile()显示(可见= 0,大小=(800, 600)):浏览器 = webdriver.Firefox(配置文件)打印(火狐")尝试:产量浏览器最后:# 让 try/finally 块管理关闭浏览器,即使# 异常被调用浏览器退出()定义主():打印(你好,来自python 2")使用 get_browser() 作为浏览器:browser.get("https://www.shareinvestor.com/my")# 点击登录按钮# waiter 是一个辅助函数,可以很容易地使用显式等待# 有了它你根本不需要使用 time.sleep() 调用login_xpath = '///*/div[@class="sic_logIn-bg"]/a'waiter.find_element(browser, login_xpath, XPATH).click()打印(浏览器.current_url)# 登录用户名 = "bkcollection"username_id = "sic_login_header_username"密码 = "123456"password_id = "sic_login_header_password"waiter.find_write(浏览器,username_id,用户名,by=ID)waiter.find_write(浏览器,password_id,密码,by=ID,send_enter=True)# 通过定位仅找到的元素等待登录过程完成# 登录后,如登录导航nav_id = 'sic_loggedInNav'waiter.find_element(浏览器,nav_id,ID)打印(登录完成")# 加载目标页面target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?""类型=price_download_all_stocks_bursa")browser.get(target_url)打印(浏览器.current_url)# 点击下载按钮all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?""type=history_all&market=bursa']")waiter.find_element(browser, all_data_xpath, XPATH).click()# 这有点挑战:你需要等到下载完成# 这个文件是 220 MB,需要一段时间才能完成.这个方法等到# 目录中至少有一个文件,然后等待直到没有# 以`.part`结尾的文件名# 请注意,如果目标目录中已经有文件,这是有问题的.一世# 建议考虑使用 tempdir 模块来创建一个独特的、临时的# 每次运行脚本时下载的目录print("等待下载完成")at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) >0WebDriverWait(glob.glob, 300).until(at_least_1)no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0WebDriverWait(glob.glob, 300).until(no_parts)打印(下载完成")# 现在对 zip 文件做任何你需要做的事情# zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')# zip_ref.extractall(DOWNLOAD_DIR)# zip_ref.close()# os.remove(zip_ref)打印(完成!")如果 __name__ == "__main__":主要的()

完全公开:我维护显式模块.它旨在使使用显式等待变得更加容易,对于这种情况,网站根据用户交互缓慢加载动态内容.您可以使用直接显式等待替换上面的所有 waiter.XXX 调用.

My purpose it to download a zip file from https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa It is a link in this webpage https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa. Then save it into this directory "/home/vinvin/shKLSE/ (I am using pythonaywhere). Then unzip it and the csv file extract in the directory.

The code run until the end with no error but it does not downloaded. The zip file is automatically downloaded when click on https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa manually.

My code with a working username and password is used. The real username and password is used so that it is easier to understand the problem.

    #!/usr/bin/python
    print "hello from python 2"

    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time
    from pyvirtualdisplay import Display
    import requests, zipfile, os    

    display = Display(visible=0, size=(800, 600))
    display.start()

    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk', '/zip')

    for retry in range(5):
        try:
            browser = webdriver.Firefox(profile)
            print "firefox"
            break
        except:
            time.sleep(3)
    time.sleep(1)

    browser.get("https://www.shareinvestor.com/my")
    time.sleep(10)
    login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
    print browser.current_url
    username = browser.find_element_by_id("sic_login_header_username")
    password = browser.find_element_by_id("sic_login_header_password")
    print "find id done"
    username.send_keys("bkcollection")
    password.send_keys("123456")
    print "log in done"
    login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
    login_attempt.submit()
    browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
    print browser.current_url
    time.sleep(20)
    dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
    time.sleep(30)

    browser.close()
    browser.quit()
    display.stop()

   zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
   zip_ref.extractall(/home/vinvin/sh/KLSE)
   zip_ref.close()
   os.remove(zip_ref)

HTML snippet:

<li><a href="/prices/price_download_zip_file.zip?type=history_all&amp;market=bursa">All Historical Data</a> <span>About 220 MB</span></li>

Note that &amp is shown when I copy the snippet. It was hidden from view source, so I guess it is written in JavaScript.

Observation I found

  1. The directory home/vinvin/shKLSE do not created even I run the code with no error

  2. I try to download a much smaller zip file which can be completed in a second but still do not download after a wait of 30s. dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_daily&date=20170519&market=bursa']").click()

解决方案

I rewrote your script, with comments explaining why I made the changes I made. I think your main problem might have been a bad mimetype, however, your script had a log of systemic issues that would have made it unreliable at best. This rewrite uses explicit waits, which completely removes the need to use time.sleep(), allowing it to run as fast as possible, while also eliminating errors that arise from network congestion.

You will need do the following to make sure all modules are installed:

pip install requests explicit selenium retry pyvirtualdisplay

The script:

#!/usr/bin/python

from __future__ import print_function  # Makes your code portable

import os
import glob
import zipfile
from contextlib import contextmanager

import requests
from retry import retry
from explicit import waiter, XPATH, ID
from selenium import webdriver
from pyvirtualdisplay import Display
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait

DOWNLOAD_DIR = "/tmp/shKLSE/"


def build_profile():
    profile = webdriver.FirefoxProfile()
    profile.set_preference('browser.download.folderList', 2)
    profile.set_preference('browser.download.manager.showWhenStarting', False)
    profile.set_preference('browser.download.dir', DOWNLOAD_DIR)
    # I think your `/zip` mime type was incorrect. This works for me
    profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
                           'application/vnd.ms-excel,application/zip')

    return profile


# Retry is an elegant way to retry the browser creation
# Though you should narrow the scope to whatever the actual exception is you are
# retrying on
@retry(Exception, tries=5, delay=3)
@contextmanager  # This turns get_browser into a context manager
def get_browser():
    # Use a context manager with Display, so it will be closed even if an
    # exception is thrown
    profile = build_profile()
    with Display(visible=0, size=(800, 600)):
        browser = webdriver.Firefox(profile)
        print("firefox")
        try:
            yield browser
        finally:
            # Let a try/finally block manage closing the browser, even if an
            # exception is called
            browser.quit()


def main():
    print("hello from python 2")
    with get_browser() as browser:
        browser.get("https://www.shareinvestor.com/my")

        # Click the login button
        # waiter is a helper function that makes it easy to use explicit waits
        # with it you dont need to use time.sleep() calls at all
        login_xpath = '//*/div[@class="sic_logIn-bg"]/a'
        waiter.find_element(browser, login_xpath, XPATH).click()
        print(browser.current_url)

        # Log in
        username = "bkcollection"
        username_id = "sic_login_header_username"
        password = "123456"
        password_id = "sic_login_header_password"
        waiter.find_write(browser, username_id, username, by=ID)
        waiter.find_write(browser, password_id, password, by=ID, send_enter=True)

        # Wait for login process to finish by locating an element only found
        # after logging in, like the Logged In Nav
        nav_id = 'sic_loggedInNav'
        waiter.find_element(browser, nav_id, ID)

        print("log in done")

        # Load the target page
        target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?"
                      "type=price_download_all_stocks_bursa")
        browser.get(target_url)
        print(browser.current_url)

        # CLick download button
        all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?"
                          "type=history_all&market=bursa']")
        waiter.find_element(browser, all_data_xpath, XPATH).click()

        # This is a bit challenging: You need to wait until the download is complete
        # This file is 220 MB, it takes a while to complete. This method waits until
        # there is at least one file in the dir, then waits until there are no
        # filenames that end in `.part`
        # Note that is is problematic if there is already a file in the target dir. I
        # suggest looking into using the tempdir module to create a unique, temporary
        # directory for downloading every time you run your script
        print("Waiting for download to complete")
        at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) > 0
        WebDriverWait(glob.glob, 300).until(at_least_1)

        no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0
        WebDriverWait(glob.glob, 300).until(no_parts)

        print("Download Done")

        # Now do whatever it is you need to do with the zip file
        # zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')
        # zip_ref.extractall(DOWNLOAD_DIR)
        # zip_ref.close()
        # os.remove(zip_ref)

        print("Done!")


if __name__ == "__main__":
    main()

Full disclosure: I maintain the explicit module. It is designed to make using explicit waits much easier, for exactly situations like this, where websites slowly load in dynamic content based on user interactions. You could replace all of the waiter.XXX calls above with direct explicit waits.

这篇关于Python:无法在网页中使用硒进行下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆