Python无法使用urllib或机械化来检索表单 [英] Python unable to retrieve form with urllib or mechanize

查看:231
本文介绍了Python无法使用urllib或机械化来检索表单的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用Python填写并提交表单,但我无法检索生成的页面。我试过了mechanize和urllib / urllib2方法来发布表单,但都遇到了问题。

我试图检索的表单在这里: http: //zrs.leidenuniv.nl/ul/start.php 。该页面是荷兰语,但这与我的问题无关。值得注意的是,表单操作会重定向到 http://zrs.leidenuniv.nl/ul/query.php

首先,这是我尝试过的urllib / urllib2方法:
$ b $

  import urllib,urllib2 
导入套接字,cookielib
$ b $ url url ='http://zrs.leidenuniv.nl/ul/start.php'
params = {'day':1,'month':5,'year':2012,'quickselect':unchecked,
'res_instantie':'_ALL_','selgebouw':'_ALL_',' zrssort':locatie,
'submit':Uitvoeren}
http_header = {User-Agent:Mozilla / 5.0(X11; Linux x86_64)AppleWebKit / 535.11(KHTML,像Gecko )Chrome / 17.0.963.46 Safari / 535.11,
Accept:text / html,application / xhtml + xml,application / xml; q = 0.9,* / *; q = 0.8,
Accept-Language:nl-NL,nl; q = 0.8,en-US; q = 0.6,en; q = 0.4}

timeout = 15
socket。 setdefaulttimeout(超时)

request = urllib2.Request(url,urllib.urlencode(params),http_header)
response = urllib2.urlopen(请求)

cookies = cookielib.CookieJar ()
cookies.extract_cookies(响应,请求)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()

opener = urllib2.build_opener (redirect_handler,cookie_handler)

response = opener.open(request)
html = response.read()

但是,当我尝试打印检索到的html时,我得到的是原始页面,而不是表单操作引用的页面。所以任何暗示,为什么这不提交表格,将不胜感激。



因为上述不起作用,我也尝试使用机械化来提交形成。但是,这会导致ParseError出现以下代码:

  import mechanize 

url ='http ://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)

最后一行退出并显示以下内容:ParseError:unexpected - 声明中的char。现在我意识到这个错误可能表明DOCTYPE声明中有错误,但由于我无法编辑表单页,因此我无法尝试不同的声明。任何关于此错误的帮助也非常感谢。



感谢您的帮助。 解决方案

这是因为 DOCTYPE 部分的格式不正确。

  <!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden> <!e-mail j.dreef @ law.leidenuniv.nl> 

试试验证自己的网页...






然而,您可以剥离垃圾来制作机械化html解析器开心:

  import mechanize 
$ b $ url url ='http://zrs.leidenuniv。 nl / ul / start.php'

br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177 :])
br.set_response(response)

br.select_form(nr = 0)


I'm trying to fill out and submit a form using Python, but I'm not able to retrieve the resulting page. I've tried both mechanize and urllib/urllib2 methods to post the form, but both run into problems.

The form I'm trying to retrieve is here: http://zrs.leidenuniv.nl/ul/start.php. The page is in Dutch, but this is irrelevant to my problem. It may be noteworthy that the form action redirects to http://zrs.leidenuniv.nl/ul/query.php.

First of all, this is the urllib/urllib2 method I've tried:

import urllib, urllib2
import socket, cookielib

url = 'http://zrs.leidenuniv.nl/ul/start.php'
params = {'day': 1, 'month': 5, 'year': 2012, 'quickselect' : "unchecked",
          'res_instantie': '_ALL_', 'selgebouw': '_ALL_', 'zrssort': "locatie",
          'submit' : "Uitvoeren"}
http_header = {  "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11",
                 "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                 "Accept-Language" : "nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4" }

timeout = 15
socket.setdefaulttimeout(timeout)

request = urllib2.Request(url, urllib.urlencode(params), http_header)
response = urllib2.urlopen(request)

cookies = cookielib.CookieJar()
cookies.extract_cookies(response, request)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()

opener = urllib2.build_opener(redirect_handler, cookie_handler)

response = opener.open(request)
html = response.read()

However, when I try to print the retrieved html I get the original page, not the one the form action refers to. So any hints as to why this doesn't submit the form would be greatly appreciated.

Because the above didn't work, I also tried to use mechanize to submit the form. However, this results in a ParseError with the following code:

import mechanize

url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)

where the last line exits with the following: "ParseError: unexpected '-' char in declaration". Now I realize that this error may indicate an error in the DOCTYPE declaration, but since I can't edit the form page I'm not able to try different declarations. Any help on this error is also greatly appreciated.

Thanks in advance for your help.

解决方案

It's because the DOCTYPE part is malformed.

Also it contains some strange tags like:

<!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden><!e-mail j.dreef@law.leidenuniv.nl >

Try validating the page yourself...


Nonetheless, you can just strip off the junk to make mechanizes html parser happy:

import mechanize

url = 'http://zrs.leidenuniv.nl/ul/start.php'

br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177:])
br.set_response(response)

br.select_form(nr = 0)

这篇关于Python无法使用urllib或机械化来检索表单的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆