Python:无法在网页中使用硒进行下载 [英] Python: Unable to download with selenium in webpage
问题描述
我的目的是从
我重新编写了您的脚本,并附有注释,解释了我进行更改的原因.我认为您的主要问题可能是一个糟糕的模仿类型,但是,您的脚本有一个系统问题的日志,这最多会使它变得不可靠.此重写使用显式等待,完全不需要使用 time.sleep()
,允许它尽可能快地运行,同时消除网络拥塞引起的错误.
您需要执行以下操作以确保安装了所有模块:
pip install 请求显式 selenium 重试 pyvirtualdisplay
脚本:
#!/usr/bin/pythonfrom __future__ import print_function # 使你的代码可移植导入操作系统导入全局导入压缩文件从上下文库导入上下文管理器进口请求从重试导入重试从显式导入服务员,XPATH,ID从硒导入网络驱动程序from pyvirtualdisplay 导入显示从 selenium.webdriver.common.keys 导入密钥从 selenium.webdriver.support.wait 导入 WebDriverWaitDOWNLOAD_DIR = "/tmp/shKLSE/"def build_profile():profile = webdriver.FirefoxProfile()profile.set_preference('browser.download.folderList', 2)profile.set_preference('browser.download.manager.showWhenStarting', False)profile.set_preference('browser.download.dir', DOWNLOAD_DIR)# 我认为你的 `/zip` mime 类型不正确.这对我有用profile.set_preference('browser.helperApps.neverAsk.saveToDisk','application/vnd.ms-excel,application/zip')返回个人资料# 重试是一种优雅的重试浏览器创建的方式# 虽然你应该将范围缩小到任何实际的例外情况# 重试@retry(异常,尝试= 5,延迟= 3)@contextmanager # 这会将 get_browser 变成上下文管理器def get_browser():# 使用带有 Display 的上下文管理器,因此它会被关闭,即使是# 异常被抛出profile = build_profile()显示(可见= 0,大小=(800, 600)):浏览器 = webdriver.Firefox(配置文件)打印(火狐")尝试:产量浏览器最后:# 让 try/finally 块管理关闭浏览器,即使# 异常被调用浏览器退出()定义主():打印(你好,来自python 2")使用 get_browser() 作为浏览器:browser.get("https://www.shareinvestor.com/my")# 点击登录按钮# waiter 是一个辅助函数,可以很容易地使用显式等待# 有了它你根本不需要使用 time.sleep() 调用login_xpath = '///*/div[@class="sic_logIn-bg"]/a'waiter.find_element(browser, login_xpath, XPATH).click()打印(浏览器.current_url)# 登录用户名 = "bkcollection"username_id = "sic_login_header_username"密码 = "123456"password_id = "sic_login_header_password"waiter.find_write(浏览器,username_id,用户名,by=ID)waiter.find_write(浏览器,password_id,密码,by=ID,send_enter=True)# 通过定位仅找到的元素等待登录过程完成# 登录后,如登录导航nav_id = 'sic_loggedInNav'waiter.find_element(浏览器,nav_id,ID)打印(登录完成")# 加载目标页面target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?""类型=price_download_all_stocks_bursa")browser.get(target_url)打印(浏览器.current_url)# 点击下载按钮all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?""type=history_all&market=bursa']")waiter.find_element(browser, all_data_xpath, XPATH).click()# 这有点挑战:你需要等到下载完成# 这个文件是 220 MB,需要一段时间才能完成.这个方法等到# 目录中至少有一个文件,然后等待直到没有# 以`.part`结尾的文件名# 请注意,如果目标目录中已经有文件,这是有问题的.一世# 建议考虑使用 tempdir 模块来创建一个独特的、临时的# 每次运行脚本时下载的目录print("等待下载完成")at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) >0WebDriverWait(glob.glob, 300).until(at_least_1)no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0WebDriverWait(glob.glob, 300).until(no_parts)打印(下载完成")# 现在对 zip 文件做任何你需要做的事情# zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')# zip_ref.extractall(DOWNLOAD_DIR)# zip_ref.close()# os.remove(zip_ref)打印(完成!")如果 __name__ == "__main__":主要的()
完全公开:我维护显式模块.它旨在使使用显式等待变得更加容易,对于这种情况,网站根据用户交互缓慢加载动态内容.您可以使用直接显式等待替换上面的所有 waiter.XXX
调用.
My purpose it to download a zip file from https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa
It is a link in this webpage https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa. Then save it into this directory "/home/vinvin/shKLSE/
(I am using pythonaywhere). Then unzip it and the csv file extract in the directory.
The code run until the end with no error but it does not downloaded. The zip file is automatically downloaded when click on https://www.shareinvestor.com/prices/price_download_zip_file.zip?type=history_all&market=bursa manually.
My code with a working username and password is used. The real username and password is used so that it is easier to understand the problem.
#!/usr/bin/python
print "hello from python 2"
import urllib2
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from pyvirtualdisplay import Display
import requests, zipfile, os
display = Display(visible=0, size=(800, 600))
display.start()
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', "/home/vinvin/shKLSE/")
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', '/zip')
for retry in range(5):
try:
browser = webdriver.Firefox(profile)
print "firefox"
break
except:
time.sleep(3)
time.sleep(1)
browser.get("https://www.shareinvestor.com/my")
time.sleep(10)
login_main = browser.find_element_by_xpath("//*[@href='/user/login.html']").click()
print browser.current_url
username = browser.find_element_by_id("sic_login_header_username")
password = browser.find_element_by_id("sic_login_header_password")
print "find id done"
username.send_keys("bkcollection")
password.send_keys("123456")
print "log in done"
login_attempt = browser.find_element_by_xpath("//*[@type='submit']")
login_attempt.submit()
browser.get("https://www.shareinvestor.com/prices/price_download.html#/?type=price_download_all_stocks_bursa")
print browser.current_url
time.sleep(20)
dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_all&market=bursa']").click()
time.sleep(30)
browser.close()
browser.quit()
display.stop()
zip_ref = zipfile.ZipFile(/home/vinvin/sh/KLSE, 'r')
zip_ref.extractall(/home/vinvin/sh/KLSE)
zip_ref.close()
os.remove(zip_ref)
HTML snippet:
<li><a href="/prices/price_download_zip_file.zip?type=history_all&market=bursa">All Historical Data</a> <span>About 220 MB</span></li>
Note that & is shown when I copy the snippet. It was hidden from view source, so I guess it is written in JavaScript.
Observation I found
The directory
home/vinvin/shKLSE
do not created even I run the code with no errorI try to download a much smaller zip file which can be completed in a second but still do not download after a wait of 30s.
dl = browser.find_element_by_xpath("//*[@href='/prices/price_download_zip_file.zip?type=history_daily&date=20170519&market=bursa']").click()
I rewrote your script, with comments explaining why I made the changes I made. I think your main problem might have been a bad mimetype, however, your script had a log of systemic issues that would have made it unreliable at best. This rewrite uses explicit waits, which completely removes the need to use time.sleep()
, allowing it to run as fast as possible, while also eliminating errors that arise from network congestion.
You will need do the following to make sure all modules are installed:
pip install requests explicit selenium retry pyvirtualdisplay
The script:
#!/usr/bin/python
from __future__ import print_function # Makes your code portable
import os
import glob
import zipfile
from contextlib import contextmanager
import requests
from retry import retry
from explicit import waiter, XPATH, ID
from selenium import webdriver
from pyvirtualdisplay import Display
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
DOWNLOAD_DIR = "/tmp/shKLSE/"
def build_profile():
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', DOWNLOAD_DIR)
# I think your `/zip` mime type was incorrect. This works for me
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
'application/vnd.ms-excel,application/zip')
return profile
# Retry is an elegant way to retry the browser creation
# Though you should narrow the scope to whatever the actual exception is you are
# retrying on
@retry(Exception, tries=5, delay=3)
@contextmanager # This turns get_browser into a context manager
def get_browser():
# Use a context manager with Display, so it will be closed even if an
# exception is thrown
profile = build_profile()
with Display(visible=0, size=(800, 600)):
browser = webdriver.Firefox(profile)
print("firefox")
try:
yield browser
finally:
# Let a try/finally block manage closing the browser, even if an
# exception is called
browser.quit()
def main():
print("hello from python 2")
with get_browser() as browser:
browser.get("https://www.shareinvestor.com/my")
# Click the login button
# waiter is a helper function that makes it easy to use explicit waits
# with it you dont need to use time.sleep() calls at all
login_xpath = '//*/div[@class="sic_logIn-bg"]/a'
waiter.find_element(browser, login_xpath, XPATH).click()
print(browser.current_url)
# Log in
username = "bkcollection"
username_id = "sic_login_header_username"
password = "123456"
password_id = "sic_login_header_password"
waiter.find_write(browser, username_id, username, by=ID)
waiter.find_write(browser, password_id, password, by=ID, send_enter=True)
# Wait for login process to finish by locating an element only found
# after logging in, like the Logged In Nav
nav_id = 'sic_loggedInNav'
waiter.find_element(browser, nav_id, ID)
print("log in done")
# Load the target page
target_url = ("https://www.shareinvestor.com/prices/price_download.html#/?"
"type=price_download_all_stocks_bursa")
browser.get(target_url)
print(browser.current_url)
# CLick download button
all_data_xpath = ("//*[@href='/prices/price_download_zip_file.zip?"
"type=history_all&market=bursa']")
waiter.find_element(browser, all_data_xpath, XPATH).click()
# This is a bit challenging: You need to wait until the download is complete
# This file is 220 MB, it takes a while to complete. This method waits until
# there is at least one file in the dir, then waits until there are no
# filenames that end in `.part`
# Note that is is problematic if there is already a file in the target dir. I
# suggest looking into using the tempdir module to create a unique, temporary
# directory for downloading every time you run your script
print("Waiting for download to complete")
at_least_1 = lambda x: len(x("{0}/*.zip*".format(DOWNLOAD_DIR))) > 0
WebDriverWait(glob.glob, 300).until(at_least_1)
no_parts = lambda x: len(x("{0}/*.part".format(DOWNLOAD_DIR))) == 0
WebDriverWait(glob.glob, 300).until(no_parts)
print("Download Done")
# Now do whatever it is you need to do with the zip file
# zip_ref = zipfile.ZipFile(DOWNLOAD_DIR, 'r')
# zip_ref.extractall(DOWNLOAD_DIR)
# zip_ref.close()
# os.remove(zip_ref)
print("Done!")
if __name__ == "__main__":
main()
Full disclosure: I maintain the explicit module. It is designed to make using explicit waits much easier, for exactly situations like this, where websites slowly load in dynamic content based on user interactions. You could replace all of the waiter.XXX
calls above with direct explicit waits.
这篇关于Python:无法在网页中使用硒进行下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!