Scrapy Crawler 只提取 680 多个网址中的 19 个 [英] Scrapy Crawler only pulls 19 of 680+ urls

查看:33
本文介绍了Scrapy Crawler 只提取 680 多个网址中的 19 个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取此页面:https://coinmarketcap.com/currencies/views/all/

I'm trying to scrape this page: https://coinmarketcap.com/currencies/views/all/

在所有行的td[2]中是一个链接.我试图让scrapy 转到那个td 中的每个链接,并抓取该链接代表的页面.下面是我的代码:

in td[2] of all the rows is a link. I am trying to ask scrapy to go to each link in that td, and scrape pages that link represents. Below is my code:

注意:另一个人在帮助我走到这一步方面很棒

class ToScrapeSpiderXPath(CrawlSpider):
    name = 'coinmarketcap'
    start_urls = [
        'https://coinmarketcap.com/currencies/views/all/'
    ]

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//td[2]/a',)), callback="parse", follow=True),
    )

    def parse(self, response):
        BTC = BTCItem()
        BTC['source'] = str(response.request.url).split("/")[2]
        BTC['asset'] = str(response.request.url).split("/")[4],
        BTC['asset_price'] = response.xpath('//*[@id="quote_price"]/text()').extract(),
        BTC['asset_price_change'] = response.xpath(
            '/html/body/div[2]/div/div[1]/div[3]/div[2]/span[2]/text()').extract(),
        BTC['BTC_price'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[1]/text()').extract(),
        BTC['Prct_change'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[2]/text()').extract()
        yield (BTC)

即使表格超过 600 多个链接/页面,当我运行 scrapy crawl coinmarketcap 时,我也只得到 19 条记录.这意味着这个 600 多个列表中只有 19 页.我没有看到停止刮擦的问题.任何帮助将不胜感激.

even tho the table exceeds 600+ links/pages, when I run scrapy crawl coinmarketcap, I only get 19 records. This means only 19 pages from this list of 600+. I'm failing to see the issue stopping the scrape. any help would be greatly appreciated.

谢谢

推荐答案

你的蜘蛛爬得太深了:按照这个规则,它也会在单个硬币的页面中找到并跟踪链接.您可以通过添加 DEPTH_LIMIT = 1 来大致解决该问题,但您肯定可以找到更优雅的解决方案.这是对我有用的代码(也有其他小的调整):

Your spider goes too deep: with that rule it find and follow links also in the single coin's pages. You can roughly fix the problem adding a DEPTH_LIMIT = 1, but you can for sure find a more elegant solution. Here the code that works for me (there are other minor adjustment too):

class ToScrapeSpiderXPath(CrawlSpider):
    name = 'coinmarketcap'
    start_urls = [
        'https://coinmarketcap.com/currencies/views/all/'
    ]
    custom_settings = {
        'DEPTH_LIMIT': '1',
    }

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//td[2]',)),callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        BTC = BTCItem()
        BTC['source'] = str(response.request.url).split("/")[2]
        BTC['asset'] = str(response.request.url).split("/")[4]
        BTC['asset_price'] = response.xpath('//*[@id="quote_price"]/text()').extract()
        BTC['asset_price_change'] = response.xpath(
            '/html/body/div[2]/div/div[1]/div[3]/div[2]/span[2]/text()').extract()
        BTC['BTC_price'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[1]/text()').extract()
        BTC['Prct_change'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[2]/text()').extract()
        yield (BTC)

这篇关于Scrapy Crawler 只提取 680 多个网址中的 19 个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆