Scrapy:为每个 start_url 动态生成规则 [英] Scrapy: Dynamically generate rules for each start_url

查看：68 发布时间：2021/7/17 18:32:04 python xpath scrapy

本文介绍了Scrapy:为每个 start_url 动态生成规则的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我创建了一个应该抓取多个网站的蜘蛛，我需要为 start_url 列表中的每个 URL 定义不同的规则.

I have created a spider which is supposed to crawl multiple websites and I need to define different rules for each URL in the start_url list.

start_urls = [
    "http://URL1.com/foo"
    "http://URL2.com/bar"
]

rules = [
    Rule (LinkExtractor(restrict_xpaths=("//" + xpathString+"/a")), callback="parse_object", follow=True)
]

规则中唯一需要更改的是restrict_xpath 的xpath 字符串.我已经想出了一个可以从任何网站动态获取我想要的 xpath 的函数.我想我可以获取蜘蛛将抓取的当前 URL 并将其传递给函数，然后将生成的 xpath 传递给规则.

The only thing that needs to change in the rule is the xpath string for restrict_xpath. I've already come up with a function that can get the xpath I want dynamically from any website. I figured I can just get the current URL that the spider will be scraping and pass it through the function and then pass the resulting xpath to the rule.

不幸的是，我一直在搜索，似乎这是不可能的，因为scrapy使用调度程序并从一开始就编译所有start_urls和规则.有什么解决方法可以实现我想要做的事情吗?

Unfortunately, I've been searching and it seems that this isn't possible since scrapy utilizes a scheduler and compiles all the start_urls and rules right from the start. Is there any workaround to achieve what I'm trying to do?

Scrapy:为每个 start_url 动态生成规则 [英] Scrapy: Dynamically generate rules for each start_url

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy:为每个 start_url 动态生成规则 [英] Scrapy: Dynamically generate rules for each start_url

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭