Crawled pages 和 Scraped items 之间的 Scrapy Spider 区别 [英] Scrapy spider difference between Crawled pages and Scraped items

查看：111 发布时间：2021/7/16 22:03:38 python web-crawler scrapy

本文介绍了Crawled pages 和 Scraped items 之间的 Scrapy Spider 区别的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在编写一个 Scrapy CrawlSpider，它读取第一页上的 AD 列表，获取一些信息，例如列表和 AD 网址的拇指，然后向每个 AD 网址发出请求以获取其详细信息.

Im writing a Scrapy CrawlSpider that reads a list of ADs on first page, takes some info like thumbs of the listings and AD urls, then yields a request to each of this AD urls to take their details.

它在测试环境中工作和分页显然很好，但今天试图进行完整运行我意识到在日志中:

It was working and paginating apparently well on test enviroment, but today trying to make a complete run I realized that in log:

抓取 3852 页(以 228 页/分钟)，抓取 256 个项目(以 15 个项目/分钟)

Crawled 3852 pages (at 228 pages/min), scraped 256 items (at 15 items/min)

我不明白抓取页面和已抓取项目之间存在如此大差异的原因.任何人都可以帮助我了解这些物品丢失的位置?

I'm not understanding the reason of this big difference between Crawled pages and Scraped items. Anybody can help me to realize where that items are getting lost?

我的蜘蛛代码:

class MySpider(CrawlSpider):
    name = "myspider"
    allowed_domains = ["myspider.com", "myspider.co"]
    start_urls = [
        "http://www.myspider.com/offers/myCity/typeOfAd/?search=fast",
    ]

    #Pagination
    rules = (
        Rule (
            SgmlLinkExtractor()
           , callback='parse_start_url', follow= True),
    )

    #1st page
    def parse_start_url(self, response):

        hxs = HtmlXPathSelector(response)

        next_page = hxs.select("//a[@class='pagNext']/@href").extract()
        offers = hxs.select("//div[@class='hlist']")

        for offer in offers:
            myItem = myItem()

            myItem['url'] = offer.select('.//span[@class="location"]/a/@href').extract()[0]
            myItem['thumb'] = oferta.select('.//div[@class="itemFoto"]/div/a/img/@src').extract()[0]

            request = Request(myItem['url'], callback = self.second_page)
            request.meta['myItem'] = myItem

            yield request

        if next_page:
            yield Request(next_page[0], callback=self.parse_start_url)


    def second_page(self,response):
        myItem = response.meta['myItem']

        loader = myItemLoader(item=myItem, response=response)

        loader.add_xpath('address', '//span[@itemprop="streetAddress"]/text()') 

        return loader.load_item()

Crawled pages 和 Scraped items 之间的 Scrapy Spider 区别 [英] Scrapy spider difference between Crawled pages and Scraped items

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Crawled pages 和 Scraped items 之间的 Scrapy Spider 区别 [英] Scrapy spider difference between Crawled pages and Scraped items

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭