Python无法使用urllib或机械化来检索表单 [英] Python unable to retrieve form with urllib or mechanize
问题描述
我试图使用Python填写并提交表单,但我无法检索生成的页面。我试过了mechanize和urllib / urllib2方法来发布表单,但都遇到了问题。
我试图检索的表单在这里: http: //zrs.leidenuniv.nl/ul/start.php 。该页面是荷兰语,但这与我的问题无关。值得注意的是,表单操作会重定向到 http://zrs.leidenuniv.nl/ul/query.php
首先,这是我尝试过的urllib / urllib2方法:
$ b $
import urllib,urllib2
导入套接字,cookielib
$ b $ url url ='http://zrs.leidenuniv.nl/ul/start.php'
params = {'day':1,'month':5,'year':2012,'quickselect':unchecked,
'res_instantie':'_ALL_','selgebouw':'_ALL_',' zrssort':locatie,
'submit':Uitvoeren}
http_header = {User-Agent:Mozilla / 5.0(X11; Linux x86_64)AppleWebKit / 535.11(KHTML,像Gecko )Chrome / 17.0.963.46 Safari / 535.11,
Accept:text / html,application / xhtml + xml,application / xml; q = 0.9,* / *; q = 0.8,
Accept-Language:nl-NL,nl; q = 0.8,en-US; q = 0.6,en; q = 0.4}
timeout = 15
socket。 setdefaulttimeout(超时)
request = urllib2.Request(url,urllib.urlencode(params),http_header)
response = urllib2.urlopen(请求)
cookies = cookielib.CookieJar ()
cookies.extract_cookies(响应,请求)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener (redirect_handler,cookie_handler)
response = opener.open(request)
html = response.read()
但是,当我尝试打印检索到的html时,我得到的是原始页面,而不是表单操作引用的页面。所以任何暗示,为什么这不提交表格,将不胜感激。
因为上述不起作用,我也尝试使用机械化来提交形成。但是,这会导致ParseError出现以下代码:
import mechanize
url ='http ://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)
最后一行退出并显示以下内容:ParseError:unexpected - 声明中的char。现在我意识到这个错误可能表明DOCTYPE声明中有错误,但由于我无法编辑表单页,因此我无法尝试不同的声明。任何关于此错误的帮助也非常感谢。
感谢您的帮助。 解决方案
这是因为 DOCTYPE
部分的格式不正确。 : <!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden> <!e-mail j.dreef @ law.leidenuniv.nl>
试试验证自己的网页...
然而,您可以剥离垃圾来制作机械化html解析器开心:
import mechanize
$ b $ url url ='http://zrs.leidenuniv。 nl / ul / start.php'
br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177 :])
br.set_response(response)
br.select_form(nr = 0)
I'm trying to fill out and submit a form using Python, but I'm not able to retrieve the resulting page. I've tried both mechanize and urllib/urllib2 methods to post the form, but both run into problems.
The form I'm trying to retrieve is here: http://zrs.leidenuniv.nl/ul/start.php. The page is in Dutch, but this is irrelevant to my problem. It may be noteworthy that the form action redirects to http://zrs.leidenuniv.nl/ul/query.php.
First of all, this is the urllib/urllib2 method I've tried:
import urllib, urllib2
import socket, cookielib
url = 'http://zrs.leidenuniv.nl/ul/start.php'
params = {'day': 1, 'month': 5, 'year': 2012, 'quickselect' : "unchecked",
'res_instantie': '_ALL_', 'selgebouw': '_ALL_', 'zrssort': "locatie",
'submit' : "Uitvoeren"}
http_header = { "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11",
"Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language" : "nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4" }
timeout = 15
socket.setdefaulttimeout(timeout)
request = urllib2.Request(url, urllib.urlencode(params), http_header)
response = urllib2.urlopen(request)
cookies = cookielib.CookieJar()
cookies.extract_cookies(response, request)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()
opener = urllib2.build_opener(redirect_handler, cookie_handler)
response = opener.open(request)
html = response.read()
However, when I try to print the retrieved html I get the original page, not the one the form action refers to. So any hints as to why this doesn't submit the form would be greatly appreciated.
Because the above didn't work, I also tried to use mechanize to submit the form. However, this results in a ParseError with the following code:
import mechanize
url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)
where the last line exits with the following: "ParseError: unexpected '-' char in declaration". Now I realize that this error may indicate an error in the DOCTYPE declaration, but since I can't edit the form page I'm not able to try different declarations. Any help on this error is also greatly appreciated.
Thanks in advance for your help.
It's because the DOCTYPE
part is malformed.
Also it contains some strange tags like:
<!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden><!e-mail j.dreef@law.leidenuniv.nl >
Try validating the page yourself...
Nonetheless, you can just strip off the junk to make mechanizes html parser happy:
import mechanize
url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177:])
br.set_response(response)
br.select_form(nr = 0)
这篇关于Python无法使用urllib或机械化来检索表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!