从网站上的按钮下载文件的Python脚本 [英] Python script to download file from button on website

查看:646
本文介绍了从网站上的按钮下载文件的Python脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过单击以下网址中的导出为ex​​cel"按钮来下载xls文件: https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD .

I want to download an xls file by clicking the button "Export to excel" from the following url: https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD.

更具体地说,该按钮为:name ="ctl00 $ MainContent $ btndata".我已经能够使用selenium做到这一点,但是,我计划使用此脚本构建一个docker映像并作为docker容器运行,因为该xls会定期更新,并且我需要本地计算机上的最新数据,但是经常打开浏览器来获取这些数据是没有意义的.我知道有chrome和firefox的无头版本,尽管我不相信它们支持下载.另外,我知道在这种情况下Web获取将无法正常工作,因为该按钮不是指向资源的静态链接.也许有一种完全不同的方法可以将这些数据下载和更新到我的计算机上?

More specifically the button: name = "ctl00$MainContent$btndata". I've already been able to do this using selenium, but, I plan on building a docker image with this script and running as a docker container because this xls is regularly updated and I need the most current data on my local machine and it doesn't make sense to have a browser open that often to fetch this data. I understand there are headless versions of chrome and firefox although I don't believe they support downloads. Also, I understand that web get will not work in this situation because the button is not a static link to the resource. Maybe there's a completely different approach for downloading and updating this data to my computer?

import urllib
import requests
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
    'Origin': 'https://www.tampagov.net',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
    'Accept-Encoding': 'gzip,deflate,br',
    'Accept-Language': 'en-US,en;q=0.5',
}

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f, "html.parser")
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']

formData = (
    ('__EVENTVALIDATION', eventvalidation),
    ('__VIEWSTATE', viewstate),
    ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
    ('Accept-Encoding', 'gzip, deflate, br'),
    ('Accept-Language', 'en-US,en;q=0.5'),
    ('Host', 'apps,tampagov.net'),
    ('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'))



payload = urllib.urlencode(formData)
# second HTTP request with form data
r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", params=payload)
print(r.status_code, r.reason)

推荐答案

首先:我删除了import urllib,因为请求"就足够了.

First: I removed import urllib because 'requests' is enough.

您遇到了一些问题:

  1. 您不需要创建一个嵌套的元组,然后应用urllib.urlencode,而是使用一本字典,这就是请求如此受欢迎的原因之一.

  1. You don't need to create one nested tuple then apply urllib.urlencode, uses one dictionary instead that is one reason why requests is so popular.

您最好填充http post请求的所有参数.就像下面我所做的一样,否则,该请求可能会被后端拒绝.

You'd better populate all parameters for the http post request. like below what I did, otherwise, the request may be rejected by the backend.

我添加了一个简单的代码将内容保存到本地.

I added one simple codes to save the content to the local.

PS:对于这些表单参数,您可以通过分析http get响应的html来获取其值.您还可以根据需要自定义参数,例如页面大小等.

PS: for those form parameters, you can get their values by analysis the html responsed from http get. Also you can customize the parameters as you need, like page size etc.

下面是一个有效的示例:

Below is a working sample:

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

def downloadExcel():
    headers = {
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
        'Origin': 'https://www.tampagov.net',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
        'Content-Type': 'application/x-www-form-urlencoded',
        'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
        'Accept-Encoding': 'gzip,deflate,br',
        'Accept-Language': 'en-US,en;q=0.5',
    }

    r = requests.get("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", headers=headers)
    # parse and retrieve two vital form values
    if not r.status_code == 200:
        print('Error')
        return
    soup = BeautifulSoup(r.content, "html.parser")
    viewstate = soup.select("#__VIEWSTATE")[0]['value']
    eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
    print ('__VIEWSTATE:', viewstate)
    print ('__EVENTVALIDATION:', eventvalidation)
    formData = {
        '__EVENTVALIDATION': eventvalidation,
        '__VIEWSTATE': viewstate,
        '__EVENTTARGET': '',
        '__EVENTARGUMENT': '',
        '__VIEWSTATEGENERATOR': '49DF2C80',
        'MainContent_RadScriptManager1_TSM':""";;System.Web.Extensions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35:en-US:59e0a739-153b-40bd-883f-4e212fc43305:ea597d4b:b25378d2;Telerik.Web.UI, Version=2015.2.826.40, Culture=neutral, PublicKeyToken=121fae78165ba3d4:en-US:c2ba43dc-851e-4009-beab-3032480b6a4b:16e4e7cd:f7645509:24ee1bba:c128760b:874f8ea2:19620875:4877f69a:f46195d3:92fe8ea0:fa31b949:490a9d4e:bd8f85e4:58366029:ed16cbdc:2003d0b8:88144a7a:1e771326:aa288e2d:b092aa46:7c926187:8674cba1:ef347303:2e42e72a:b7778d6c:c08e9f8a:e330518b:c8618e41:e4f8f289:1a73651d:16d8629e:59462f1:a51ee93e""",
        'search_block_form':'',
        'ctl00$MainContent$btndata':'Export to Excel',
        'ctl00_MainContent_RadWindow1_C_RadGridVehicles_ClientState':'',
        'ctl00_MainContent_RadWindow1_ClientState':'',
        'ctl00_MainContent_RadWindowManager1_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl00$PageSizeComboBox':'20',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time$dateInput':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_dateInput_ClientState':'{"enabled":true,"emptyMessage":"","validationText":"","valueAsString":"","minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00","lastSetTextBoxValue":""}',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_ClientState':'{"minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00"}',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1address':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1address_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1case_description':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1case_description_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_grid':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1report_number':'',
        'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1report_number_ClientState':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_max_date':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_rowcount':'',
        'ctl00$MainContent$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox':'20',
        'ctl00_MainContent_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState':'',
        'ctl00_MainContent_RadGrid1_rfltMenu_ClientState':'',
        'ctl00_MainContent_RadGrid1_gdtcSharedTimeView_ClientState':'',
        'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_SD':'[]',
        'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_AD':'[[1900,1,1],[2099,12,31],[2018,3,29]]',
        'ctl00_MainContent_RadGrid1_ClientState':'',
        }

    # second HTTP request with form data
    r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", data=formData, headers=headers)
    print('received:', r.status_code, len(r.content))
    with open(r"C:\Users\xxx\Desktop\test\test\apps.xls", "wb") as handle:
        for data in tqdm(r.iter_content()):
            handle.write(data)

downloadExcel()

这篇关于从网站上的按钮下载文件的Python脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆