如何使用python抓取aspx页面 [英] How to scrape aspx pages with python

查看:772
本文介绍了如何使用python抓取aspx页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一个网站, https://www.searchiqs.com/nybro/(您必须单击以访客身份登录"以进入搜索表单.如果我搜索第1方的术语,例如"Andrew",则结果将具有分页功能,并且请求类型为POST,因此URL不会更改,会话也很快超时.如此之快,以至于如果我等十分钟并刷新搜索网址页面,就会出现超时错误.

I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.

我最近开始抓取,因此我大部分时间都在进行GET帖子,以了解URL.到目前为止,我已经意识到我将不得不研究DOM.使用Chrome工具,我发现了标题.在网络"标签中,我还发现了以下内容,它们是从搜索页面传递到结果页面的表单数据

I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page

__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)
__VIEWSTATEGENERATOR:F92D01D0
__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)
BrowserWidth:1243
BrowserHeight:705
ctl00$ContentPlaceHolder1$scrollPos:0
ctl00$ContentPlaceHolder1$txtName:david
ctl00$ContentPlaceHolder1$chkIgnorePartyType:on
ctl00$ContentPlaceHolder1$txtFromDate:
ctl00$ContentPlaceHolder1$txtThruDate:
ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)
ctl00$ContentPlaceHolder1$cboDocType:(ALL)
ctl00$ContentPlaceHolder1$cboTown:(ALL)
ctl00$ContentPlaceHolder1$txtPinNum:
ctl00$ContentPlaceHolder1$txtBook:
ctl00$ContentPlaceHolder1$txtPage:
ctl00$ContentPlaceHolder1$txtUDFNum:
ctl00$ContentPlaceHolder1$txtCaseNum:
ctl00$ContentPlaceHolder1$cmdSearch:Search

所有大写字母都被隐藏了.我也设法弄清楚了结果的结构.

All the ones in caps are hidden. I have also managed to figure out the results structure.

到目前为止,我的脚本确实很可悲,因为我对下一步工作完全一无所知.我仍在进行表单提交,分析分页并刮除结果,但是我完全不知道如何进行.

My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.

import re
import urlparse
import mechanize

from bs4 import BeautifulSoup

class DocumentFinderScraper(object):
    def __init__(self):
        self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"
        self.br = mechanize.Browser()
        self.br.addheaders = [('User-agent', 
                               'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]

##TO DO

    ##submit form

    #get return URL

    #scrape results

    #analyze pagination


if __name__ == '__main__':
    scraper = DocumentFinderScraper()
    scraper.scrape()

任何帮助将不胜感激

推荐答案

我禁用了Javascript,并访问了 https://www.searchiqs.com/nybro/,其格式如下:

I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:

如您所见,登录以访客身份登录按钮被禁用.这将使Mechanize无法工作,因为它无法处理Javascript,并且您将无法提交表单.

As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.

对于此类问题,您可以使用Selenium,它将模拟完整的浏览器,但缺点是比Mechanize慢.

For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.

此代码应使用Selenium登录

This code should log you in using Selenium:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

usr = ""
pwd = ""

driver = webdriver.Firefox() 
driver.get("https://www.searchiqs.com/nybro/")
assert "IQS" in driver.title
elem = driver.find_element_by_id("txtUserID")
elem.send_keys(usr)
elem = driver.find_element_by_id("txtPassword")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)

这篇关于如何使用python抓取aspx页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆