用于 AJAX 内容的 Scrapy CrawlSpider [英] Scrapy CrawlSpider for AJAX content

查看:43
本文介绍了用于 AJAX 内容的 Scrapy CrawlSpider的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取新闻文章的网站.我的 start_url 包含:

I am attempting to crawl a site for news articles. My start_url contains:

(1) 每篇文章的链接:http://example.com/symbol/TSLA

(1) links to each article: http://example.com/symbol/TSLA

(2) 一个更多"按钮,它进行 AJAX 调用,在同一个 start_url 中动态加载更多文章:http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0&slugs=tsla&is_symbol_page=true

(2) a "More" button that makes an AJAX call that dynamically loads more articles within the same start_url: http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0&slugs=tsla&is_symbol_page=true

AJAX 调用的一个参数是page",每次单击More"按钮时它都会增加.例如,点击更多"一次会额外加载n篇文章并更新更多"按钮onClick事件中的页面参数,以便下次点击更多"时,将加载页面"两篇文章(假设"page" 0 最初加载,而 "page" 1 在第一次点击时加载).

A parameter to the AJAX call is "page", which is incremented each time the "More" button is clicked. For example, clicking "More" once will load an additional n articles and update the page parameter in the "More" button onClick event, so that next time "More" is clicked, "page" two of articles will be loaded (assuming "page" 0 was loaded initially, and "page" 1 was loaded on the first click).

对于每个页面",我想使用规则抓取每篇文章的内容,但我不知道有多少页面",我不想选择任意的 m(例如,10k).我似乎无法弄清楚如何设置它.

For each "page" I would like to scrape the contents of each article using Rules, but I do not know how many "pages" there are and I do not want to choose some arbitrary m (e.g., 10k). I can't seem to figure out how to set this up.

从这个问题,Scrapy Crawl URLs in Order,我试图创建潜在 URL 的 URL 列表,但在解析前一个 URL 并确保它包含 CrawlSpider 的新闻链接后,我无法确定如何以及从池中发送新 URL.我的规则将响应发送到 parse_items 回调,在那里解析文章内容.

From this question, Scrapy Crawl URLs in Order, I have tried to create a URL list of potential URLs, but I can't determine how and where to send a new URL from the pool after parsing the previous URL and ensuring it contains news links for a CrawlSpider. My Rules send responses to a parse_items callback, where the article contents are parsed.

在应用规则和调用 parse_items 之前,有没有办法观察链接页面的内容(类似于 BaseSpider 示例),以便我知道何时停止抓取?

Is there a way to observe the contents of the links page (similar to the BaseSpider example) before applying Rules and calling parse_items so that I may know when to stop crawling?

简化代码(为了清晰起见,我删除了几个正在解析的字段):

Simplified code (I removed several of the fields I'm parsing for clarity):

class ExampleSite(CrawlSpider):

    name = "so"
    download_delay = 2

    more_pages = True
    current_page = 0

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0'+
                      '&slugs=tsla&is_symbol_page=true']

    ##could also use
    ##start_urls = ['http://example.com/symbol/tsla']

    ajax_urls = []                                                                                                                                                                                                                                                                                                                                                                                                                          
    for i in range(1,1000):
        ajax_urls.append('http://example.com/account/ajax_headlines_content?type=in_focus_articles&page='+str(i)+
                      '&slugs=tsla&is_symbol_page=true')

    rules = (
             Rule(SgmlLinkExtractor(allow=('/symbol/tsla', ))),
             Rule(SgmlLinkExtractor(allow=('/news-article.*tesla.*', '/article.*tesla.*', )), callback='parse_item')
            )

        ##need something like this??
        ##override parse?
        ## if response.body == 'no results':
            ## self.more_pages = False
            ## ##stop crawler??   
        ## else: 
            ## self.current_page = self.current_page + 1
            ## yield Request(self.ajax_urls[self.current_page], callback=self.parse_start_url)


    def parse_item(self, response):

        self.log("Scraping: %s" % response.url, level=log.INFO)

        hxs = Selector(response)

        item = NewsItem()

        item['url'] = response.url
        item['source'] = 'example'
        item['title'] = hxs.xpath('//title/text()')
        item['date'] = hxs.xpath('//div[@class="article_info_pos"]/span/text()')

        yield item

推荐答案

Crawl spider 可能对您的目的来说太有限了.如果您需要很多逻辑,通常最好从 Spider 继承.

Crawl spider may be too limited for your purposes here. If you need a lot of logic you are usually better off inheriting from Spider.

Scrapy 提供了 CloseSpider 异常,当您需要在某些条件下停止解析时可以引发该异常.您正在抓取的页面返回您的股票没有焦点文章"的消息,当您超过最大页面时,您可以检查此消息并在出现此消息时停止迭代.

Scrapy provides CloseSpider exception that can be raised when you need to stop parsing under certain conditions. The page you are crawling returns a message "There are no Focus articles on your stocks", when you exceed maximum page, you can check for this message and stop iteration when this message occurs.

在你的情况下,你可以这样做:

In your case you can go with something like this:

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.exceptions import CloseSpider

class ExampleSite(Spider):
    name = "so"
    download_delay = 0.1

    more_pages = True
    next_page = 1

    start_urls = ['http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0'+
                      '&slugs=tsla&is_symbol_page=true']

    allowed_domains = ['example.com']

    def create_ajax_request(self, page_number):
        """
        Helper function to create ajax request for next page.
        """
        ajax_template = 'http://example.com/account/ajax_headlines_content?type=in_focus_articles&page={pagenum}&slugs=tsla&is_symbol_page=true'

        url = ajax_template.format(pagenum=page_number)
        return Request(url, callback=self.parse)

    def parse(self, response):
        """
        Parsing of each page.
        """
        if "There are no Focus articles on your stocks." in response.body:
            self.log("About to close spider", log.WARNING)
            raise CloseSpider(reason="no more pages to parse")


        # there is some content extract links to articles
        sel = Selector(response)
        links_xpath = "//div[@class='symbol_article']/a/@href"
        links = sel.xpath(links_xpath).extract()
        for link in links:
            url = urljoin(response.url, link)
            # follow link to article
            # commented out to see how pagination works
            #yield Request(url, callback=self.parse_item)

        # generate request for next page
        self.next_page += 1
        yield self.create_ajax_request(self.next_page)

    def parse_item(self, response):
        """
        Parsing of each article page.
        """
        self.log("Scraping: %s" % response.url, level=log.INFO)

        hxs = Selector(response)

        item = NewsItem()

        item['url'] = response.url
        item['source'] = 'example'
        item['title'] = hxs.xpath('//title/text()')
        item['date'] = hxs.xpath('//div[@class="article_info_pos"]/span/text()')

        yield item

这篇关于用于 AJAX 内容的 Scrapy CrawlSpider的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆