使用 PostBack 数据抓取页面 javascript Python Scrapy [英] Crawling through pages with PostBack data javascript Python Scrapy
问题描述
我正在通过 Scrapy 使用 ASP.NET 编程爬过一些目录.
I'm crawling through some directories with ASP.NET programming via Scrapy.
要抓取的页面编码如下:
The pages to crawl through are encoded as such:
javascript:__doPostBack('ctl00$MainContent$List','Page$X')
其中 X 是 1 到 180 之间的整数. MainContent 参数始终相同.我不知道如何爬进这些.我很想在 SLE 规则中添加一些简单的内容,例如 allow=('Page$')
或 attrs='__doPostBack'
,但我猜我必须为了从javascript链接"中提取信息,要更加棘手.
where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$')
or attrs='__doPostBack'
, but my guess is that I have to be trickier in order to pull the info from the javascript "link."
如果更容易从 javascript 代码中取消屏蔽"每个绝对链接并将它们保存到 csv,然后使用该 csv 将请求加载到新的抓取工具中,那也没关系.
If it's easier to "unmask" each of the absolute links from the javascript code and save those to a csv, then use that csv to load requests into a new scraper, that's okay, too.
推荐答案
这种分页并不像看起来那么简单.解决它是一个有趣的挑战.下面提供的解决方案有几个重要的注意事项:
This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:
- 这里的想法是在
Request.meta
字典 - 使用常规
BaseSpider
因为分页涉及一些逻辑 - 提供
headers
伪装成真正的浏览器很重要 - 产生
FormRequest
很重要 s withdont_filter=True
因为我们基本上是向相同的 URL 发出POST
请求,但参数不同
- the idea here is to follow the pagination page by page passing around the current page in the
Request.meta
dictionary - using a regular
BaseSpider
since there is some logic involved in the pagination - it is important to provide
headers
pretending to be a real browser - it is important to yield
FormRequest
s withdont_filter=True
since we are basically making aPOST
request to the same URL but with different parameters
代码:
import re
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
HEADERS = {
'X-MicrosoftAjax': 'Delta=true',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'
class ExitRealtySpider(BaseSpider):
name = "exit_realty"
allowed_domains = ["exitrealty.com"]
start_urls = [URL]
def parse(self, response):
# submit a form (first page)
self.data = {}
for form_input in response.css('form#aspnetForm input'):
name = form_input.xpath('@name').extract()[0]
try:
value = form_input.xpath('@value').extract()[0]
except IndexError:
value = ""
self.data[name] = value
self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
self.data['__EVENTARGUMENT'] = 'Page$1'
return FormRequest(url=URL,
method='POST',
callback=self.parse_page,
formdata=self.data,
meta={'page': 1},
dont_filter=True,
headers=HEADERS)
def parse_page(self, response):
current_page = response.meta['page'] + 1
# parse agents (TODO: yield items instead of printing)
for agent in response.xpath('//a[@class="regtext"]/text()'):
print agent.extract()
print "------"
# request the next page
data = {
'__EVENTARGUMENT': 'Page$%d' % current_page,
'__EVENTVALIDATION': re.search(r"__EVENTVALIDATION|(.*?)|", response.body, re.MULTILINE).group(1),
'__VIEWSTATE': re.search(r"__VIEWSTATE|(.*?)|", response.body, re.MULTILINE).group(1),
'__ASYNCPOST': 'true',
'__EVENTTARGET': 'ctl00$MainContent$agentList',
'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
'': ''
}
return FormRequest(url=URL,
method='POST',
formdata=data,
callback=self.parse_page,
meta={'page': current_page},
dont_filter=True,
headers=HEADERS)
这篇关于使用 PostBack 数据抓取页面 javascript Python Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!