scrapy 允许所有子域 [英] scrapy allow all subdomains
问题描述
我想用 Scrapy 抓取一个网站,它的页面被分成了很多子域我知道我需要一个带有 Rule
的 CrawlSpider
但我需要规则只是允许所有子域并让解析器根据数据处理自己"(意思是 - 在item_links 位于不同子域中的示例)
I want to use Scrapy to crawl a website that it's pages are divided into a lot of subdomains
I know I need a CrawlSpider
with a Rule
but I need the Rule to be just "allow all subdomains and let the parsers handle themselves according to the data" (meaning - in the example the item_links are in different subdomains)
代码示例:
def parse_page(self, response):
sel = Selector(response)
item_links = sel.xpath("XXXXXXXXX").extract()
for item_link in item_links:
item_request = Request(url=item_link,
callback=self.parse_item)
yield item_request
def parse_item(self, response):
sel = Selector(response)
** 编辑 **只是为了说明问题,我希望能够抓取所有 *.example.com ->意思是不要将 过滤的异地请求发送到 'foo.example.com'
** EDIT **
Just to make the question clear, I want the ability to crawl all of *.example.com ->
meaning not to get Filtered offsite request to 'foo.example.com'
** 另一个编辑 **按照@agstudy 的回答,请确保不要忘记删除 allowed_domains = ["www.example.com"]
** ANOTHER EDIT **
Following @agstudy's answer, make sure you don't forget to delete allowed_domains = ["www.example.com"]
推荐答案
您可以为规则设置一个 allow_domains
列表:
You can set an allow_domains
list for the rule :
rules = (
Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)
例如:
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)
这将过滤允许的网址,如:
This will filter allow urls like :
www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/
这篇关于scrapy 允许所有子域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!