scrapy 允许所有子域 [英] scrapy allow all subdomains

查看:59
本文介绍了scrapy 允许所有子域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 Scrapy 抓取一个网站,它的页面被分成了很多子域我知道我需要一个带有 RuleCrawlSpider 但我需要规则只是允许所有子域并让解析器根据数据处理自己"(意思是 - 在item_links 位于不同子域中的示例)

I want to use Scrapy to crawl a website that it's pages are divided into a lot of subdomains I know I need a CrawlSpider with a Rule but I need the Rule to be just "allow all subdomains and let the parsers handle themselves according to the data" (meaning - in the example the item_links are in different subdomains)

代码示例:

def parse_page(self, response):
    sel = Selector(response)
    item_links = sel.xpath("XXXXXXXXX").extract()
    for item_link in item_links:
            item_request = Request(url=item_link,
                                     callback=self.parse_item)
            yield item_request

def parse_item(self, response):
    sel = Selector(response)

** 编辑 **只是为了说明问题,我希望能够抓取所有 *.example.com ->意思是不要将 过滤的异地请求发送到 'foo.example.com'

** EDIT ** Just to make the question clear, I want the ability to crawl all of *.example.com -> meaning not to get Filtered offsite request to 'foo.example.com'

** 另一个编辑 **按照@agstudy 的回答,请确保不要忘记删除 allowed_domains = ["www.example.com"]

** ANOTHER EDIT ** Following @agstudy's answer, make sure you don't forget to delete allowed_domains = ["www.example.com"]

推荐答案

您可以为规则设置一个 allow_domains 列表:

You can set an allow_domains list for the rule :

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)

例如:

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)

这将过滤允许的网址,如:

This will filter allow urls like :

www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/

这篇关于scrapy 允许所有子域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆