如何使用 Scrapy 抓取整个网站? [英] How to crawl an entire website with Scrapy?

查看：41 发布时间：2021/7/16 21:58:11 web web-scraping scrapy

本文介绍了如何使用 Scrapy 抓取整个网站?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我无法抓取整个网站，Scrapy 只会抓取表面，我想抓取更深的内容.过去 5-6 小时一直在谷歌搜索，但没有帮助.我的代码如下:

I'm unable to crawl a whole website, Scrapy just crawls at the surface, I want to crawl deeper. Been googling for the last 5-6 hours and no help. My code below:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log

class ExampleSpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)

推荐答案

规则短路，这意味着链接满足的第一条规则将是应用的规则，您的第二条规则(带回调)不会被调用.

Rules short-circuit, meaning that the first rule a link satisfies will be the rule that gets applied, your second Rule (with callback) will not be called.

将您的规则更改为:

rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]

这篇关于如何使用 Scrapy 抓取整个网站?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 Scrapy 抓取整个网站? [英] How to crawl an entire website with Scrapy?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用 Scrapy 抓取整个网站? [英] How to crawl an entire website with Scrapy?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭