强制scrapy抓取链接,以便它们出现 [英] Force scrapy to crawl link in order they appear

查看:72
本文介绍了强制scrapy抓取链接,以便它们出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个带有scrapy的蜘蛛来抓取网站,索引页面是一个链接列表,如www.link1.com、www.link2.com、www.link3.com并且该站点更新非常频繁,所以我的爬虫是一个每隔几个小时运行的进程的一部分,但我只想爬行我还没有爬过的新链接.我的问题是scrapy在深入时随机化它处理每个链接的方式.是否可以强制 sracpy 按顺序爬行?像 1 然后是 2 然后是 3,这样我就可以保存我爬过的最后一个链接,当再次开始这个过程时,只需将链接 1 与以前的链接 1 进行比较?

I'm writing a spider with scrapy to crawl a website, the index page is a list of link like www.link1.com, www.link2.com, www.link3.com and that site is updated really often, so my crawler is part of a process that runs everey hours, but I would like to crawl only the new link that i havent crawled yet. my problem is that scrapy randomise the way it treats each link when going deep. is it possible to force sracpy to crawl in order ? Like 1 then 2 and then 3, so that I can save the last link that i've crawled and when starting the process again just compare link 1 with formerly link 1 ?

希望这是可以理解的,抱歉我的英语不好,

Hope this is understandable, sorry for my poor english,

请回复,

谢谢

class SymantecSpider(CrawlSpider):

    name = 'symantecSpider'
    allowed_domains = ['symantec.com']
    start_urls = [
        'http://www.symantec.com/security_response/landing/vulnerabilities.jsp'
        ]
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="mrgnMD"]/following-sibling::table')), callback='parse_item')]

    def parse_item(self, response):
        open("test.t", "ab").write(response.url + "\n")

推荐答案

试试这个例子.
构建一个列表并将所有链接附加到它.
然后将它们一一弹出以按顺序处理您的请求.

Try this example.
Construct a list and append all the links to it.
Then pop them one by one to get your requests in order.

我建议做一些类似@Hassan 提及的事情,并将您的内容通过管道传输到数据库.

I recommend doing something like @Hassan mention and pipe your contents to a database.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log


class SymantecSpider(BaseSpider):
    name = 'symantecSpider'
    allowed_domains = ['symantec.com']
    allLinks = []
    base_url = "http://www.symantec.com"

    def start_requests(self):
        return [Request('http://www.symantec.com/security_response/landing/vulnerabilities.jsp', callback=self.parseMgr)]

    def parseMgr(self, response):
        # This grabs all the links and append them to allLinks=[]
        self.allLinks.append(HtmlXPathSelector(response).select("//table[@class='defaultTableStyle tableFontMD tableNoBorder']/tbody/tr/td[2]/a/@href").extract())
        return Request(self.base_url + self.allLinks[0].pop(0), callback=self.pageParser)

    # Cycle through the allLinks[] in order
    def pageParser(self, response):
        log.msg('response: %s' % response.url, level=log.INFO)
        return Request(self.base_url + self.allLinks[0].pop(0), callback=self.pageParser)

这篇关于强制scrapy抓取链接,以便它们出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆