使用 PostBack 数据抓取页面 javascript Python Scrapy [英] Crawling through pages with PostBack data javascript Python Scrapy

查看:18
本文介绍了使用 PostBack 数据抓取页面 javascript Python Scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过 Scrapy 使用 ASP.NET 编程爬过一些目录.

I'm crawling through some directories with ASP.NET programming via Scrapy.

要抓取的页面编码如下:

The pages to crawl through are encoded as such:

javascript:__doPostBack('ctl00$MainContent$List','Page$X')

其中 X 是 1 到 180 之间的整数. MainContent 参数始终相同.我不知道如何爬进这些.我很想在 SLE 规则中添加一些简单的内容,例如 allow=('Page$')attrs='__doPostBack',但我猜我必须为了从javascript链接"中提取信息,要更加棘手.

where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack', but my guess is that I have to be trickier in order to pull the info from the javascript "link."

如果更容易从 javascript 代码中取消屏蔽"每个绝对链接并将它们保存到 csv,然后使用该 csv 将请求加载到新的抓取工具中,那也没关系.

If it's easier to "unmask" each of the absolute links from the javascript code and save those to a csv, then use that csv to load requests into a new scraper, that's okay, too.

推荐答案

这种分页并不像看起来那么简单.解决它是一个有趣的挑战.下面提供的解决方案有几个重要的注意事项:

This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:

  • the idea here is to follow the pagination page by page passing around the current page in the Request.meta dictionary
  • using a regular BaseSpider since there is some logic involved in the pagination
  • it is important to provide headers pretending to be a real browser
  • it is important to yield FormRequests withdont_filter=True since we are basically making a POST request to the same URL but with different parameters

代码:

import re

from scrapy.http import FormRequest
from scrapy.spider import BaseSpider


HEADERS = {
    'X-MicrosoftAjax': 'Delta=true',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'


class ExitRealtySpider(BaseSpider):
    name = "exit_realty"

    allowed_domains = ["exitrealty.com"]
    start_urls = [URL]

    def parse(self, response):
        # submit a form (first page)
        self.data = {}
        for form_input in response.css('form#aspnetForm input'):
            name = form_input.xpath('@name').extract()[0]
            try:
                value = form_input.xpath('@value').extract()[0]
            except IndexError:
                value = ""
            self.data[name] = value

        self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
        self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
        self.data['__EVENTARGUMENT'] = 'Page$1'

        return FormRequest(url=URL,
                           method='POST',
                           callback=self.parse_page,
                           formdata=self.data,
                           meta={'page': 1},
                           dont_filter=True,
                           headers=HEADERS)

    def parse_page(self, response):
        current_page = response.meta['page'] + 1

        # parse agents (TODO: yield items instead of printing)
        for agent in response.xpath('//a[@class="regtext"]/text()'):
            print agent.extract()
        print "------"

        # request the next page
        data = {
            '__EVENTARGUMENT': 'Page$%d' % current_page,
            '__EVENTVALIDATION': re.search(r"__EVENTVALIDATION|(.*?)|", response.body, re.MULTILINE).group(1),
            '__VIEWSTATE': re.search(r"__VIEWSTATE|(.*?)|", response.body, re.MULTILINE).group(1),
            '__ASYNCPOST': 'true',
            '__EVENTTARGET': 'ctl00$MainContent$agentList',
            'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
            '': ''
        }

        return FormRequest(url=URL,
                           method='POST',
                           formdata=data,
                           callback=self.parse_page,
                           meta={'page': current_page},
                           dont_filter=True,
                           headers=HEADERS)

这篇关于使用 PostBack 数据抓取页面 javascript Python Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆