scrapy 允许所有子域 [英] scrapy allow all subdomains

查看：59 发布时间：2021/7/16 22:07:11 python scrapy

本文介绍了scrapy 允许所有子域的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想用 Scrapy 抓取一个网站，它的页面被分成了很多子域我知道我需要一个带有 Rule 的 CrawlSpider 但我需要规则只是允许所有子域并让解析器根据数据处理自己"(意思是 - 在item_links 位于不同子域中的示例)

I want to use Scrapy to crawl a website that it's pages are divided into a lot of subdomains I know I need a CrawlSpider with a Rule but I need the Rule to be just "allow all subdomains and let the parsers handle themselves according to the data" (meaning - in the example the item_links are in different subdomains)

代码示例:

def parse_page(self, response):
    sel = Selector(response)
    item_links = sel.xpath("XXXXXXXXX").extract()
    for item_link in item_links:
            item_request = Request(url=item_link,
                                     callback=self.parse_item)
            yield item_request

def parse_item(self, response):
    sel = Selector(response)

** 编辑 **只是为了说明问题，我希望能够抓取所有 *.example.com ->意思是不要将 过滤的异地请求发送到 'foo.example.com'

** EDIT ** Just to make the question clear, I want the ability to crawl all of *.example.com -> meaning not to get Filtered offsite request to 'foo.example.com'

** 另一个编辑 **按照@agstudy 的回答，请确保不要忘记删除 allowed_domains = ["www.example.com"]

** ANOTHER EDIT ** Following @agstudy's answer, make sure you don't forget to delete allowed_domains = ["www.example.com"]

推荐答案

您可以为规则设置一个 allow_domains 列表:

You can set an allow_domains list for the rule :

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)

例如:

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)

这将过滤允许的网址，如:

This will filter allow urls like :

www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/

这篇关于scrapy 允许所有子域的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

scrapy 允许所有子域 [英] scrapy allow all subdomains

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

scrapy 允许所有子域 [英] scrapy allow all subdomains

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭