Scrapy 抓取所有站点地图链接 [英] Scrapy crawl all sitemap links

查看:61
本文介绍了Scrapy 抓取所有站点地图链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取固定站点的 sitemap.xml 中存在的所有链接.我遇到了 Scrapy 的 SitemapSpider.到目前为止,我已经提取了站点地图中的所有网址.现在我想通过站点地图的每个链接爬行.任何帮助都会非常有用.到目前为止的代码是:

I want to crawl all he links present in the sitemap.xml of a fixed site. I've came across Scrapy's SitemapSpider. So far i've extracted all the urls in the sitemap. Now i want to crawl through each link of the sitemap. Any help would be highly useful. The code so far is:

class MySpider(SitemapSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    sitemap_urls = ["http://www.xyz.nl/sitemap.xml"] 

    def parse(self, response):
        print response.url

推荐答案

需要添加sitemap_rules来处理爬取的url中的数据,可以随意创建.例如,假设您有一个名为 http://www.xyz.nl//x/ 的页面您要创建规则:

You need to add sitemap_rules to process the data in the crawled urls, and you can create as many as you want. For instance say you have a page named http://www.xyz.nl//x/ you want to create a rule:

class MySpider(SitemapSpider):
    name = 'xyz'
    sitemap_urls = 'http://www.xyz.nl/sitemap.xml'
    # list with tuples - this example contains one page 
    sitemap_rules = [('/x/', parse_x)]

    def parse_x(self, response):
        sel = Selector(response)
        paragraph = sel.xpath('//p').extract()

        return paragraph

这篇关于Scrapy 抓取所有站点地图链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆