为什么这个ScraperWiki一个ASPX网站只返回搜索结果的同一页? [英] Why does this ScraperWiki for an ASPX site return only the same page of search results?

查看:155
本文介绍了为什么这个ScraperWiki一个ASPX网站只返回搜索结果的同一页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用ScraperWiki的工具来刮的ASP供电的站点。

我想抓住从BBSmates.com网站的特定区域code BBSes的列表。该网站显示,每次20 BBS搜索结果,所以我要做的表单提交,以摆脱一个结果页下。

博客文章帮我上手。我以为以下code会抓住的BBS上市的最后一页为314区code(79页)。

不过,我得到的回应是第一​​页。

  URL ='http://bbsmat​​es.com/browsebbs.aspx?BBSName=&Area$c$c=314
BR = mechanize.Browser()
br.addheaders = [(用户代理,Mozilla的/ 5.0(X11; U; Linux的i686的; EN-US; rv中:1.9.0.1)的Gecko / 2008071615的Fedora / 3.0.1-1.fc9火狐/ 3.0。 1')]
响应= br.open(URL)HTML = response.read()br.select_form(名称='aspnetForm')
br.form.set_all_readonly(假)
BR ['__ EVENTTARGET'] ='$ ctl00 $ ContentPlaceHolder1 GridView1
BR ['__ EVENTARGUMENT'] ='页面$ 79'
打印br.form
响应2 = br.submit()HTML2 = response2.read()
打印HTML2

博客文章我上面提到提到,在他们的情况有一个的 SubmitControl 的一个问题,所以我想这个表格上禁用这两个SubmitControls。

  br.find_control(ctl00 $ cmdLogin)。禁用= TRUE

禁用cmdLogin产生HTTP错误500。

  br.find_control(ctl00 $ ContentPlaceHolder1 $ Button1的)。禁用= TRUE

禁用ContentPlaceHolder1 $ Button1的没有任何区别。提交通过了,但返回的页面还是第1页的搜索结果。

这是值得注意的是,本网站不使用页面$下一步。

谁能帮我找出我需要做的就是ASPX表单提交的工作?


解决方案

您需要发布页面提供的值(EVENTVALIDATION,VIEWSTATE,等等)。

这code就可以了(注意,它使用真棒请求库而不是机械化)

 进口lxml.html
进口要求
starturl ='http://bbsmat​​es.com/browsebbs.aspx?BBSName=&Area$c$c=314
S = requests.session()#创建会话对象
R1 = s.get(starturl)#获取第1页
HTML = r1.text
根= lxml.html.fromstring(HTML)#pick了JavaScript的值
EVENTVALIDATION = root.xpath('//输入[@name =__ EVENTVALIDATION]')[0] .attrib ['值']
#find的__EVENTVALIDATION值
VIEWSTATE = root.xpath('//输入[@name =__ VIEWSTATE]')[0] .attrib ['值']
#find的__VIEWSTATE值
#构建一个字典张贴与我们所收集的值的站点。该__EVENTARGUMENT可以改变抓取另一个结果网页(3,4,5等)
有效载荷= {'__EVENTTARGET: 'ctl00$ContentPlaceHolder1$GridView1','__EVENTARGUMENT':'Page$25','__EVENTVALIDATION':EVENTVALIDATION,'__VIEWSTATE':VIEWSTATE,'__VIEWSTATEENCRYPTED':'','ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZip$c$c':'','ctl00$ContentPlaceHolder1$txtArea$c$c':'314','ctl00$ContentPlaceHolder1$txt$p$pfix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'}
# 发表它
R2 = s.post(starturl,数据=净荷)
#我们的反应,现在是第2页
打印r2.text

当你得到的结果的末尾(resultpage 21)你必须重新拿起VIEWSTATE和EVENTVALIDATION值(并每20页)。

请注意,有是空的一些价值观,你的帖子,和几个,其中包括价值。完整名单是这样的:

<$p$p><$c$c>'ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZip$c$c':'','ctl00$ContentPlaceHolder1$txtArea$c$c':'314','ctl00$ContentPlaceHolder1$txt$p$pfix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'

下面是一个类似的问题Scraperwiki邮件列表上的讨论:的https: //groups.google.com/forum/#​​!topic/scraperwiki/W0Xi7AxfZp0

I'm trying to scrape an ASP-powered site using ScraperWiki's tools.

I want to grab a list of BBSes in a particular area code from the BBSmates.com website. The site displays 20 BBS search results at a time, so I will have to do form submits in order to move from one page of results to the next.

This blog post helped me get started. I thought the following code would grab the final page of BBS listings for the 314 area code (page 79).

However, the response I get is the FIRST page.

url = 'http://bbsmates.com/browsebbs.aspx?BBSName=&AreaCode=314'
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open(url)

html = response.read()

br.select_form(name='aspnetForm')
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$GridView1'
br['__EVENTARGUMENT'] = 'Page$79'
print br.form
response2 = br.submit()

html2 = response2.read()
print html2

The blog post I cited above mentions that in their case there was a problem with a SubmitControl, so I tried disabling the two SubmitControls on this form.

br.find_control("ctl00$cmdLogin").disabled = True

Disabling cmdLogin generated HTTP Error 500.

br.find_control("ctl00$ContentPlaceHolder1$Button1").disabled = True

Disabling ContentPlaceHolder1$Button1 didn't make any difference. The submit went through, but the page it returned was still page 1 of the search results.

It's worth noting that this site does NOT use "Page$Next."

Can anyone help me figure out what I need to do to get ASPX form submit to work?

解决方案

You need to post the values the page gives (EVENTVALIDATION, VIEWSTATE, etc.).

This code will work (note that it uses the awesome Requests library and not Mechanize)

import lxml.html 
import requests
starturl = 'http://bbsmates.com/browsebbs.aspx?BBSName=&AreaCode=314'
s = requests.session() # create a session object 
r1 = s.get(starturl) #get page 1
html = r1.text
root = lxml.html.fromstring(html)

#pick up the javascript values 
EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]')[0].attrib['value'] 
#find the __EVENTVALIDATION value 
VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]')[0].attrib['value'] 
#find the __VIEWSTATE value
# build a dictionary to post to the site with the values we have collected. The __EVENTARGUMENT can be changed to fetch another result page (3,4,5 etc.)
payload = {'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1','__EVENTARGUMENT':'Page$25','__EVENTVALIDATION':EVENTVALIDATION,'__VIEWSTATE':VIEWSTATE,'__VIEWSTATEENCRYPTED':'','ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZipCode':'','ctl00$ContentPlaceHolder1$txtAreaCode':'314','ctl00$ContentPlaceHolder1$txtPrefix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'}
# post it 
r2 = s.post(starturl, data=payload)
# our response is now page 2 
print r2.text

When you get to the end of the results (resultpage 21) you have to pick up the VIEWSTATE and EVENTVALIDATION values again (and do that every 20 pages).

Note that there are a few values that you post that are empty, and a few that include values. The full list is like this:

'ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZipCode':'','ctl00$ContentPlaceHolder1$txtAreaCode':'314','ctl00$ContentPlaceHolder1$txtPrefix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'

Here is a discussion on the Scraperwiki mailing list on a similar problem: https://groups.google.com/forum/#!topic/scraperwiki/W0Xi7AxfZp0

这篇关于为什么这个ScraperWiki一个ASPX网站只返回搜索结果的同一页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆