Python Selenium PhantomJS-提取正在下载文件的下载链接 [英] Python Selenium PhantomJS - Extract download link of file that is being downloaded

查看:731
本文介绍了Python Selenium PhantomJS-提取正在下载文件的下载链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,正如标题所示,我正在尝试使用PhantomJS通过python 3.7中的硒来获得下载文件的直接链接

So as the title suggests, I am trying to get the direct link of a downloading file using PhantomJS through selenium in Python 3.7

我正在工作的网站是emuparadise.me,我正在下载一个rom文件,并对此browser.current_url显示about:blank并且我知道通过检查PhantomJS的网络使用情况,文件已开始下载.现在已经浏览互联网超过3个小时,我还没有找到任何方法来获取下载文件的网址.

The site I am working on is emuparadise.me, I am downloading a rom file with a request to this link after adding a cookie to avoid getting "Invalid Referer" error. When the request is made browser.current_url shows about:blank and I know that the file has started downloading by checking network usage for PhantomJS. Having been browsing the internet for over 3 hours now, I haven't found any way of retrieving the url of the downloading file.

我对解决方案的想法之一是创建一个线程来跟踪对browser.current_url的更改,但是似乎browser在发出请求时锁定了

One of my thoughts for a solution was creating a thread for tracking changes to browser.current_url but it seems like browser locks up when making the request

这是我当前的代码:

from selenium import webdriver


browser = webdriver.PhantomJS()
browser.add_cookie({'name': 'refexception', 'value': 1, 'domain': '.emuparadise.me', 'path': '/'})
browser.get("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true")

请注意,我根本不关心文件的下载,我也不知道也不需要知道文件的下载位置.我发现了实际的链接.我真的更喜欢在Firefox或Chrome网络驱动程序上使用PhantomJS来完成如此简单的任务.任何帮助将不胜感激.

Note that I don't care at all about downloading the file, neither do I know or need to know where it's being downloaded. I've discovered the actual link for that specific example file from firefox in case you need it for testing. I also really prefer using PhantomJS over Firefox or Chrome web drivers for such a simple looking task. Any help would be highly appreciated.

推荐答案

所以我终于想到了解决方案.由于我知道下载URL必须在请求标头中的某处,因此我寻找了一种查看PhantomJS的方法.确实,这很容易.我所做的只是将日志级别从INFO(默认)更改为DEBUG,并且标头出现在日志文件中的事件page.onResourceRequestedpage.onResourceReceived下.发出请求后,我只是解析日志文件以查找后一个事件并抓取该URL.这是完整的代码:

So I finally came up to the solution. Since I know that the download url must be somewhere in the headers of my request, I searched for a way to view them for PhantomJS. It was pretty easy, indeed. All I did was change the log level from INFO(default) to DEBUG and the headers appeared in the log file under the events page.onResourceRequested and page.onResourceReceived. After making the request, I am just parsing the log file looking for the latter event and scraping out the url. Here's the complete code:

from selenium import webdriver
from json import loads


def get_direct_url_for_game(url):
    browser = webdriver.PhantomJS(service_args=["--webdriver-loglevel=DEBUG"])
    browser.add_cookie({'name': 'refexception', 'value': 1, 'domain': '.emuparadise.me', 'path': '/'})
    browser.get(download_url)

    direct_download_url = None
    with open('ghostdriver.log') as logs:
        for line in logs:
            _, _, event, event_data = line.split(" - ")
            if event == "page.onResourceReceived":
                event_data = loads(event_data)
                if event_data['contentType'] == "application/octet-stream":
                    direct_download_url = event_data['url']
                    browser.quit()
    return direct_download_url


print(get_url_for_game("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true"))


实际上,我发现了一种更简单的方法,可以使用requests'head函数更轻松,更轻松地实现完全相同的目标.这是在请求URL的HTTP标头(即名称),我们仍将传递相同的cookie.我们将允许重定向,因为这是我们想要的,并且URL将在请求的变量url下.

I actually found out a much simpler way of achieving the exact same thing much easier and more elegantly using requests' head function. This is making a request for the HTTP Headers of the url, hence the name, and we will still pass in the same cookie. We will allow redirects since that's what we want and the url will be under the variable url of the request.

这是它的一面:

from requests import head


def get_direct_url_for_game(url):
    request = head(game_url, allow_redirects=True, cookies={'refexception': '1'})
    return request.url


print(get_direct_url_for_game("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true"))

这篇关于Python Selenium PhantomJS-提取正在下载文件的下载链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆