Python - 从 aspx 表单下载文件 [英] Python - Download a file from aspx form
问题描述
我正在尝试从该站点自动获取一些数据:http://www.casablanca-bourse.com/bourseweb/en/Negociation-History.aspx?Cat=24&IdLink=225
I'm trying to get automaticly some data from this site : http://www.casablanca-bourse.com/bourseweb/en/Negociation-History.aspx?Cat=24&IdLink=225
在 python 中使用 urllib2,我成功地得到了一个 html 文件,就好像我在这个网站上点击了提交"按钮一样.
using urllib2 in python, I got successfully an html file as if I click on "submit" button in this web site.
但是,当我模拟点击下载数据"链接的行为时,我得到了任何输出.
But, when I simulate the behaviour of clicking in the link "download data" I got anything as output.
我的代码是:
import urllib
import urllib2
uri = 'http://www.casablanca-bourse.com/bourseweb/en/Negociation-History.aspx?Cat=24&IdLink=225'
headers = {
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}
formFields = (
(r'TopControl1$ScriptManager1', r'HistoriqueNegociation1$UpdatePanel1|HistoriqueNegociation1$HistValeur1$LinkButton1'),
(r'__EVENTTARGET', r'HistoriqueNegociation1$HistValeur1$LinkButton1'),
(r'__EVENTARGUMENT', r''),
(r'__VIEWSTATE', r'/wEPDwUKMTcy/ ... +ZHYQBq1hB/BZ2BJyHdLM='), #just a small part because it's so long !
(r'TopControl1$TxtRecherche', r''),
(r'TopControl1$txtValeur', r''),
(r'HistoriqueNegociation1$HistValeur1$DDValeur', r'9000 '),
(r'HistoriqueNegociation1$HistValeur1$historique', r'RBSearchDate'),
(r'HistoriqueNegociation1$HistValeur1$DateTimeControl1$TBCalendar', r'22/12/2014'),
(r'HistoriqueNegociation1$HistValeur1$DateTimeControl2$TBCalendar', r'28/12/2014'),
(r'HistoriqueNegociation1$HistValeur1$DDuree', r'6'),
(r'hiddenInputToUpdateATBuffer_CommonToolkitScripts', r'1')
)
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri, encodedFields, headers)
f = urllib2.urlopen(req)
我应该怎么做才能获得与单击站点中的下载数据"链接相同的文件?
What should I do in order to get the same file as if I click on the "download data" link in the site ?
谢谢
推荐答案
首先,我建议你使用 requests 库而不是 urllib.我们还需要一个 BeautifulSoup 来处理 HTML 标签:
First of all, I would suggest you to usу requests library instead urllib. Also we need a BeautifulSoup for working with HTML tags:
pip install requests
pip install beautifulsoup4
然后,代码将如下所示:
Than, code will look like this:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
payload = {
r'TopControl1$ScriptManager1': r'HistoriqueNegociation1$UpdatePanel1|HistoriqueNegociation1$HistValeur1$LinkButton1',
r'__EVENTTARGET': r'HistoriqueNegociation1$HistValeur1$LinkButton1',
r'__EVENTARGUMENT': r'',
r'TopControl1$TxtRecherche': r'',
r'TopControl1$txtValeur': r'',
r'HistoriqueNegociation1$HistValeur1$DDValeur': r'9000 ',
r'HistoriqueNegociation1$HistValeur1$historique': r'RBSearchDate',
r'HistoriqueNegociation1$HistValeur1$DateTimeControl1$TBCalendar': r'22/12/2014',
r'HistoriqueNegociation1$HistValeur1$DateTimeControl2$TBCalendar': r'28/12/2014',
r'HistoriqueNegociation1$HistValeur1$DDuree': r'6',
r'hiddenInputToUpdateATBuffer_CommonToolkitScripts': r'1'
}
uri = 'http://www.casablanca-bourse.com/bourseweb/en/Negociation-History.aspx?Cat=24&IdLink=225'
r = session.get(uri)
#Find __VIEWSTATE value, there is only one input tag with type="hidden"
soup = BeautifulSoup(r.text)
viewstate_tag = soup.find('input', attrs={"type" : "hidden"})
payload[viewstate_tag['name']] = viewstate_tag['value']
r = session.post(uri, payload)
print r.text #contains html table with data
首先,我们获取原始页面,提取 __VIEWSTATE
值并将该值用于第二个请求.
First, we get the original page, extract __VIEWSTATE
value and use that value for the second request.
这篇关于Python - 从 aspx 表单下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!