如何在python中向.aspx页面提交查询 [英] how to submit query to .aspx page in python

查看:38
本文介绍了如何在python中向.aspx页面提交查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从 .aspx 网页中抓取查询结果.

I need to scrape query results from an .aspx web page.

http://legistar.council.nyc.gov/Legislation.aspx

url 是静态的,那么我如何向该页面提交查询并获得结果?假设我们需要从相应的下拉菜单中选择所有年份"和所有类型".

The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.

一定有人知道如何做到这一点.

Somebody out there must know how to do this.

推荐答案

作为概述,您需要执行四个主要任务:

As an overview, you will need to perform four main tasks:

  • 向网站提交请求,
  • 从站点检索响应
  • 解析这些响应
  • 有一些逻辑可以在上面的任务中迭代,并使用与导航相关的参数(到结果列表中的下一个"页面)

http 请求和响应处理是使用 Python 标准库的urlliburllib2.html 页面的解析可以使用 Python 的标准库的 HTMLParser 或其他模块来完成,例如作为 Beautiful Soup

The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup

以下代码段演示了在问题中指示的站点上请求和接收搜索.该站点是 ASP 驱动的,因此我们需要确保我们发送多个表单字段,其中一些具有可怕"的值,因为 ASP 逻辑使用这些值来维护状态并在一定程度上验证请求.确实提交.必须使用 http POST 方法 发送请求,因为这是该 ASP 应用程序所期望的.主要的困难在于识别 ASP 期望的表单字段和关联值(使用 Python 获取页面是容易的部分).

The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).

这段代码是函数式的,或者更准确地说,函数式的,直到我删除了大部分 VSTATE 值,并且可能通过添加注释引入了一两个错字.

This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.

import urllib
import urllib2

uri = 'http://legistar.council.nyc.gov/Legislation.aspx'

#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type. 
headers = {
    'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
    'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
    'Content-Type': 'application/x-www-form-urlencoded'
}

# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples.  This helps
# with clarity and also makes it easy to later encoding of them.

formFields = (
   # the viewstate is actualy 800+ characters in length! I truncated it
   # for this sample code.  It can be lifted from the first page
   # obtained from the site.  It may be ok to hardcode this value, or
   # it may have to be refreshed each time / each day, by essentially
   # running an extra page request and parse, for this specific value.
   (r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),

   # following are more of these ASP form fields
   (r'__VIEWSTATE', r''),
   (r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
   (r'ctl00_RadScriptManager1_HiddenField', ''), 
   (r'ctl00_tabTop_ClientState', ''), 
   (r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
   (r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),

   #but then we come to fields of interest: the search
   #criteria the collections to search from etc.
                                                       # Check boxes  
   (r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'),  # file number
   (r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'),  # Legislative text
   (r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'),  # attachement
                                                       # etc. (not all listed)
   (r'ctl00$ContentPlaceHolder1$txtSearch', 'york'),   # Search text
   (r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'),  # Years to include
   (r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'),  #types to include
   (r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation')  # Search button itself
)

# these have to be encoded    
encodedFields = urllib.urlencode(formFields)

req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req)     #that's the actual call to the http site.

# *** here would normally be the in-memory parsing of f 
#     contents, but instead I store this to file
#     this is useful during design, allowing to have a
#     sample of what is to be parsed in a text editor, for analysis.

try:
  fout = open('tmp.htm', 'w')
except:
  print('Could not open output file
')

fout.writelines(f.readlines())
fout.close()

这就是获取初始页面的内容.如上所述,然后需要解析页面,即找到感兴趣的部分并适当地收集它们,并将它们存储到文件/数据库/任何地方.这项工作可以通过多种方式完成:使用 html 解析器,或 XSLT 类型的技术(实际上是在将 html 解析为 xml 之后),甚至对于粗略的工作,简单的正则表达式.此外,通常提取的项目之一是下一个信息",即某种链接,可用于对服务器的新请求以获取后续页面.

That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.

这应该让您大致了解什么是长手"html 抓取.还有许多其他方法可以解决此问题,例如专用实用程序、Mozilla (FireFox) GreaseMonkey 插件中的脚本、XSLT...

This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...

这篇关于如何在python中向.aspx页面提交查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆