如何下载一个文件与Python,Selenium和PhantomJS [英] How to download a file with Python, Selenium and PhantomJS

查看:1310
本文介绍了如何下载一个文件与Python,Selenium和PhantomJS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是我的情况:我必须登录到网站,并从那里下载CSV,无头从一个linux服务器。该页面使用JS,没有它不工作。

Here is my situation: I have to login to a Website and download a CSV from there, headless from a linux server. The page uses JS and does not work without it.

经过一些研究,我去了Selenium和PhantomJS。
登录,设置CSV的参数,并找到Selenium / PhantomJS / Py3的下载按钮没有问题,实际上令人惊讶的是令人惊喜。

After some research I went with Selenium and PhantomJS. Logging in, setting the parameters for the CSV and finding the download button with Selenium/PhantomJS/Py3 was no problem, actually surprisingly enjoyable.

下载按钮没有做任何事情。经过一些研究,我发现PhantomJS似乎不支持下载对话框和下载,但它是在即将到来的功能列表。

But clicking the download button did not do anything. After some research I found out that PhantomJS does not seem to support download-dialogs and downloads but that it is on the upcoming feature list.

所以我想我使用一个解决方法与 urllib 后,我发现下载按钮只是调用REST API的网址。问题是,它只有当你登录的网站工作。
所以第一次尝试失败,因为它返回: b'{success:false,session:expired}' Selenium和urllib使用不同的会话。
所以我想我在urrlib中使用Seleniums驱动程序的头文件:

So I thought I use a workaround with urllib after I found out that the download button is just calling a REST API Url. Problem is, it only works if you're logged into the site. So the first attempt failed as it returned: b'{"success":false,"session":"expired"}' which makes sense as I expect Selenium and urllib to use different sessions. So I thought I use the headers from Seleniums driver in urrlib trying this:

...
url = 'http://www.foo.com/api/index'
data = urllib.parse.urlencode({
        'foopara': 'cadbrabar',
    }).encode('utf-8')
headers = {}
for cookie in driver.get_cookies():
    headers[cookie['name']] = cookie['value']
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
    page = response.read()
driver.close()

不幸的是,这导致了过期会话的相同结果。我做错了,是否有办法解决这个问题,其他建议还是我在死胡同?先感谢。

Unfortunately this yielded the same result of an expired session. Am I doing somthing wrong, is there a way around this, other suggestions or am I at a dead end? Thanks in advance.

推荐答案

我找到了一个解决方案,希望与您分享。
一个需求改变,我不再使用 PhantomJS ,但 chromedriver 无限制地使用虚拟帧缓冲。

I found a solution and wanted to share it. One requirement changed, I am not using PhantomJS anymore but the chromedriver which works headlessly with a virtual framebuffer. Same result and it gets the job done.

您需要的是:

pip install selenium pyvirtualdsiplay

apt-get install xvfb

apt-get install xvfb

ChromeDriver

我使用Py3.5和ovh.net的testfile,而不是按钮。
脚本等待在页面上显示,然后单击它。如果不等待元素并且在异步网站上,您尝试点击的元素可能不存在。下载位置是相对于脚本位置的文件夹。如果文件已经第二次延迟下载,脚本将检查该目录。如果我没有错误的文件应该是.part在下载期间,一旦它成为 filename 中指定的.dat脚本完成。如果关闭虚拟帧缓冲区和驱动程序之前下载将无法完成。
完整的脚本如下所示:

I use Py3.5 and a testfile from ovh.net with an tag instead of a button. The script waits for the to be present on the page then clicks it. If you don't wait for the element and are on an async site, the element you try to click might not be there yet. The download location is a folder relative to the scripts location. The script checks that directory if the file is downloaded already with a second delay. If I am not wrong files should be .part during download and as soon as it becomes the .dat specified in filename the script finishes. If you close the virtual framebuffer and driver before the download will not complete. The complete script looks like this:

# !/usr/bin/python
# coding: utf-8

import os
import sys
import time
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import glob


def main(argv):
    url = 'http://ovh.net/files'
    dl_dir = 'downloads'
    filename = '1Mio.dat'

    display = Display(visible=0, size=(800, 600))
    display.start()

    chrome_options = webdriver.ChromeOptions()
    dl_location = os.path.join(os.getcwd(), dl_dir)

    prefs = {"download.default_directory": dl_location}
    chrome_options.add_experimental_option("prefs", prefs)
    chromedriver = "./chromedriver"
    driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chrome_options)

    driver.set_window_size(800, 600)
    driver.get(url)
    WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//a[@href="' + filename + '"]')))

    hyperlink = driver.find_element_by_xpath('//a[@href="' + filename + '"]')
    hyperlink.click()

    while not(glob.glob(os.path.join(dl_location, filename))):
        time.sleep(1)

    driver.close()
    display.stop()

if __name__ == '__main__':
    main(sys.argv)



这有助于未来的人。

I hope this helps someone in the future.

这篇关于如何下载一个文件与Python,Selenium和PhantomJS的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆