Scrapy:为每个 start_url 动态生成规则 [英] Scrapy: Dynamically generate rules for each start_url

查看:68
本文介绍了Scrapy:为每个 start_url 动态生成规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个应该抓取多个网站的蜘蛛,我需要为 start_url 列表中的每个 URL 定义不同的规则.

I have created a spider which is supposed to crawl multiple websites and I need to define different rules for each URL in the start_url list.

start_urls = [
    "http://URL1.com/foo"
    "http://URL2.com/bar"
]

rules = [
    Rule (LinkExtractor(restrict_xpaths=("//" + xpathString+"/a")), callback="parse_object", follow=True)
]

规则中唯一需要更改的是restrict_xpath 的xpath 字符串.我已经想出了一个可以从任何网站动态获取我想要的 xpath 的函数.我想我可以获取蜘蛛将抓取的当前 URL 并将其传递给函数,然后将生成的 xpath 传递给规则.

The only thing that needs to change in the rule is the xpath string for restrict_xpath. I've already come up with a function that can get the xpath I want dynamically from any website. I figured I can just get the current URL that the spider will be scraping and pass it through the function and then pass the resulting xpath to the rule.

不幸的是,我一直在搜索,似乎这是不可能的,因为scrapy使用调度程序并从一开始就编译所有start_urls和规则.有什么解决方法可以实现我想要做的事情吗?

Unfortunately, I've been searching and it seems that this isn't possible since scrapy utilizes a scheduler and compiles all the start_urls and rules right from the start. Is there any workaround to achieve what I'm trying to do?

推荐答案

我假设您正在使用 CrawlSpider.默认情况下,CrawlSpider rules 适用于您的蜘蛛正在抓取的所有页面(无论域).

I assume you are using CrawlSpider. By default, CrawlSpider rules are applied for all pages (whatever the domain) your spider is crawling.

如果您在起始 URL 中抓取多个域,并且希望每个域有不同的规则,您将无法告诉 scrapy 将哪些规则应用于哪个域.(我的意思是,它不是开箱即用的)

If you are crawling multiple domains in start URLs, and want different rules for each domains, you wont be able to tell scrapy which rule(s) to apply to which domain. (I mean, it's not available out of the box)

您可以一次使用 1 个起始 URL 运行您的蜘蛛(以及特定于域的规则,在初始化时动态构建).并并行运行多个蜘蛛.

You can run your spider with 1 start URL at a time (and domain-specific rules, built dynamically at init time). And run multiple spiders in paralel.

另一种选择是子类化 CrawlSpider 并根据您的需要对其进行自定义:

Another option is to subclass CrawlSpider and customize it for your needs:

  • rules 构建为使用域作为键的字典,值是应用于该域的规则列表.请参阅 _compile_rules方法.
  • 并根据响应的域应用不同的规则.请参阅_requests_to_follow
  • build rules as a dict using domains as keys, and values being the list of rules to apply for that domain. See _compile_rules method.
  • and apply different rules depending on the domain of the response. See _requests_to_follow

这篇关于Scrapy:为每个 start_url 动态生成规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆