如何提交查询在Python中.aspx页面中 [英] how to submit query to .aspx page in python

查看:174
本文介绍了如何提交查询在Python中.aspx页面中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从一个.aspx网页刮查询结果。

I need to scrape query results from an .aspx web page.

<一个href=\"http://legistar.council.nyc.gov/Legislation.aspx\">http://legistar.council.nyc.gov/Legislation.aspx

该网址是静态的,所以我怎么查询提交到这个页面,得到的结果?假设我们需要从各自的下拉菜单中选择所有年份和所有类型。

The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.

有人在那里必须知道如何做到这一点。

Somebody out there must know how to do this.

推荐答案

作为一个概述,你将需要执行四项主要任务:

As an overview, you will need to perform four main tasks:


  • 要提交请求(S)的网站

  • 来检索网站的响应(S)

  • 来解析这些响应

  • 要具有一定的逻辑在上面的任务进行迭代,与导航相关的(在结果列表下一个页)
  • 参数

HTTP请求和响应处理与从Python的标准库的的urllib 并的的urllib2 。 HTML页面的解析可以用Python的标准库的的HTMLParser 或与其他模块,如的Beautiful汤

The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup

下面的代码演示了请求并接收在这个问题表示该网站的搜索。这个网站是ASP驱动的,因此,我们需要确保我们发送多个表单域,其中一些以可怕的价值,因为这些都是使用的ASP逻辑来保持状态,并进行身份验证的要求在一定程度上。事实上提交。这些要求必须与在 HTTP POST方法发送,因为这是从该ASP应用程序的预期。主要的困难是识别表单字段和关联值而ASP期待(获得与Python的网页是容易的部分)。

The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).

这code是功能性的,或更多precisely,的的功能,直到我删除了大部分VSTATE价值,并有可能通过添加注释推出了两个错字。

This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.

import urllib
import urllib2

uri = 'http://legistar.council.nyc.gov/Legislation.aspx'

#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type. 
headers = {
    'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
    'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
    'Content-Type': 'application/x-www-form-urlencoded'
}

# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples.  This helps
# with clarity and also makes it easy to later encoding of them.

formFields = (
   # the viewstate is actualy 800+ characters in length! I truncated it
   # for this sample code.  It can be lifted from the first page
   # obtained from the site.  It may be ok to hardcode this value, or
   # it may have to be refreshed each time / each day, by essentially
   # running an extra page request and parse, for this specific value.
   (r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),

   # following are more of these ASP form fields
   (r'__VIEWSTATE', r''),
   (r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
   (r'ctl00_RadScriptManager1_HiddenField', ''), 
   (r'ctl00_tabTop_ClientState', ''), 
   (r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
   (r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),

   #but then we come to fields of interest: the search
   #criteria the collections to search from etc.
                                                       # Check boxes  
   (r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'),  # file number
   (r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'),  # Legislative text
   (r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'),  # attachement
                                                       # etc. (not all listed)
   (r'ctl00$ContentPlaceHolder1$txtSearch', 'york'),   # Search text
   (r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'),  # Years to include
   (r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'),  #types to include
   (r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation')  # Search button itself
)

# these have to be encoded    
encodedFields = urllib.urlencode(formFields)

req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req)     #that's the actual call to the http site.

# *** here would normally be the in-memory parsing of f 
#     contents, but instead I store this to file
#     this is useful during design, allowing to have a
#     sample of what is to be parsed in a text editor, for analysis.

try:
  fout = open('tmp.htm', 'w')
except:
  print('Could not open output file\n')

fout.writelines(f.readlines())
fout.close()

这就是它的初始页面获得。正如上面所说的,一会又需要解析的页面,即找到感兴趣的部分,并收集他们适当,并将其存储到文件/数据库/徘徊无论。这项工作可在非常多的方式来完成:使用HTML解析器,或垃圾技术XSLT类型(实际上解析HTML到XML之后),甚至对原油的工作,简单的正则-EX pression。同样的,一个典型的提取项中的一个是下一个信息,即各种各样的链接,可以在一个新的请求被用来向服务器获得后续页。

That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.

这应该给你一个什么样的长手HTML刮的是一个粗略的味道。还有很多其他的方法来此,如专用工具程序,在Mozilla的(火狐)脚本的GreaseMonkey插件,XSLT ...

This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...

这篇关于如何提交查询在Python中.aspx页面中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆