Scrapy刮板速度慢的原因 [英] Cause of slow Scrapy scraper

查看:34
本文介绍了Scrapy刮板速度慢的原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个非常慢的新 Scrapy 蜘蛛.它每秒只能抓取大约两页,而我创建的其他 Scrapy 爬虫的抓取速度要快得多.

I have created a new Scrapy spider that is extremely slow. It only scrapes around two pages per second, whereas the other Scrapy crawlers that I have created have been crawling a lot faster.

我想知道是什么导致了这个问题,以及如何解决这个问题.代码与其他蜘蛛并没有太大区别,我不确定它是否与问题有关,但如果您认为可能涉及,我会添加它.

I was wondering what is it that could cause this issue, and how to possibly fix that. The code is not very different from the other spiders and I am not sure if it is related to the issue, but I'll add it if you think it may be involved.

事实上,我的印象是请求不是异步的.我从来没有遇到过这种问题,而且我对 Scrapy 还很陌生.

In fact, I have the impression that the requests are not asynchronous. I have never run into this kind of problem, and I am fairly new to Scrapy.

编辑

这是蜘蛛:

class DatamineSpider(scrapy.Spider):
    name = "Datamine"
    allowed_domains = ["domain.com"]
    start_urls = (
        'http://www.example.com/en/search/results/smth/smth/r101/m2108m',
    )

    def parse(self, response):
        for href in response.css('.searchListing_details .search_listing_title .searchListing_title a::attr("href")'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_stuff)
        next_page = response.css('.pagination .next a::attr("href")')
        next_url = response.urljoin(next_page.extract()[0])
        yield scrapy.Request(next_url, callback=self.parse)

    def parse_stuff(self, response):
        item = Item()
        item['value'] = float(response.xpath('//*[text()="Price" and not(@class)]/../../div[2]/span/text()').extract()[0].split(' ')[1].replace(',',''))
        item['size'] =  float(response.xpath('//*[text()="Area" and not(@class)]/../../div[2]/text()').extract()[0].split(' ')[0].replace(',', '.'))
        try:
            item['yep'] = float(response.xpath('//*[text()="yep" and not(@class)]/../../div[2]/text()').extract()[0])
        except IndexError:
            print "NO YEP"
        else:
            yield item

推荐答案

只有两个潜在原因,因为您的蜘蛛表明您非常小心/经验丰富.

There are only two potential reasons, given that your spiders indicate that you're quite careful/experienced.

  1. 您的目标网站的响应时间很短
  2. 每个页面只有 1-2 个列表页面(您使用 parse_stuff() 解析的页面).
  1. Your target site's response time is very low
  2. Every page has only 1-2 listing pages (the ones that you parse with parse_stuff()).

很可能是后者.有半秒的响应时间是合理的.这意味着通过跟随分页(下一个)链接,您将有效地每秒抓取 2 个索引页面.由于您正在浏览 - 我猜 - 作为单个域,您的最大并发将是 ~ min(CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN).对于默认设置,这通常为 8.但是您将无法利用这种并发性,因为您创建列表 URL 的速度不够快.如果您的 .searchListing_details .search_listing_title .searchListing_title a::attr("href") 表达式仅创建一个 URL,则您创建列表 URL 的速率仅为 2/秒,而要充分利用您的下载器并发级别为 8 时,您应该至少创建 7 个 URL/索引页.

Highly likely the latter is the reason. It's reasonable to have a response time of half a second. This means that by following the pagination (next) link, you will be effectively be crawling 2 index pages per second. Since you're browsing - I guess - as single domain, your maximum concurrency will be ~ min(CONCURRENT_REQUESTS, CONCURRENT_REQUESTS_PER_DOMAIN). This typically is 8 for the default settings. But you won't be able to utilise this concurrency because you don't create listing URLs fast enough. If your .searchListing_details .search_listing_title .searchListing_title a::attr("href") expression creates just a single URL, the rate with which you create listing URLs is just 2/second whereas to fully utilise your downloader with a concurrency level of 8 you should be creating at least 7 URLs/index page.

唯一好的解决方案是分片"索引并开始爬行,例如通过设置许多不重叠的 start_urls 并行处理多个类别.例如.您可能想同时抓取电视、洗衣机、立体声音响或任何其他类别.如果您有 4 个这样的类别,并且 Scrapy 每秒为每个类别点击"它们的下一步"按钮 2 次,那么您将创建 8 个列表页面/秒,粗略地说,您将更好地利用您的下载器.

The only good solution is to "shard" the index and start crawling e.g. multiple categories in parallel by setting many non-overlaping start_urls. E.g. you might want to crawl TVs, Washing machines, Stereos or whatever other categories in parallel. If you have 4 such categories and Scrapy "clicks" their 'next' button for each one of them 2 times a second, you will be creating 8 listing pages/second and roughly speaking, you would utilise much better your downloader.

附言next_page.extract()[0] == next_page.extract_first()

离线讨论后更新:是的...除了它很慢(由于节流或由于服务器容量)之外,我在这个网站上没有看到任何特别奇怪的地方.一些加快速度的特定技巧.通过设置 4 start_urls 而不是 1,将索引速度提高 4 倍.

Update after discussing this offline: Yes... I don't see anything extra-weird on this website apart from that it's slow (either due to throttling or due to their server capacity). Some specific tricks to go faster. Hit the indices 4x as fast by settings 4 start_urls instead of 1.

start_urls = (
    'http://www.domain.com/en/search/results/smth/sale/r176/m3685m',
    'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_200',
    'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_400',
    'http://www.domain.com/en/search/results/smth/smth/r176/m3685m/offset_600'
)

然后使用更高的并发性来允许并行加载更多的 URL.通过将其设置为一个较大的值,例如停用"CONCURRENT_REQUESTS_PER_DOMAIN1000 然后通过将 CONCURRENT_REQUESTS 设置为 30 来调整您的并发性.默认情况下,您的并发性被 CONCURRENT_REQUESTS_PER_DOMAIN 限制为 8,例如,在您列出响应时间的情况下页面 >1.2 秒,意味着每秒最多 6 个列表页面的爬行速度.所以这样称呼你的蜘蛛:

Then use higher concurrency to allow for more URLs to be loaded in parallel. Essentially "deactivate" CONCURRENT_REQUESTS_PER_DOMAIN by setting it to a large value e.g. 1000 and then tune your concurrency by setting CONCURRENT_REQUESTS to 30. By default your concurrency is limited by CONCURRENT_REQUESTS_PER_DOMAIN to 8 which in, for example, your case where the response time for listing pages is >1.2 sec, means a max of 6 listing pages per second crawling speed. So call your spider like this:

scrapy crawl MySpider -s CONCURRENT_REQUESTS_PER_DOMAIN=1000 -s CONCURRENT_REQUESTS=30

它应该做得更好.

还有一件事.我从您的目标站点观察到,您可以从索引页面本身获得所需的所有信息,包括 PriceAreayep,而无需点击"任何列表页面.这将立即使您的抓取速度提高 10 倍,因为您无需使用 for href... 循环下载所有这些列表页面.只需解析索引页面中的列表即可.

One more thing. I observe from your target site, that you can get all the information you need including Price, Area and yep from the index pages themselves without having to "hit" any listing pages. This would instantly 10x your crawling speed since you don't need to download all these listing pages in with the for href... loop. Just parse the listings from the index page.

这篇关于Scrapy刮板速度慢的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆