加速网页刮刀 [英] Speed up web scraper

查看:36
本文介绍了加速网页刮刀的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy 使用非常简单的网页抓取工具抓取 23770 个网页.我对scrapy甚至python都很陌生,但设法编写了一个完成这项工作的蜘蛛.然而,它真的很慢(爬取 23770 个页面需要大约 28 小时).

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages).

我查看了 scrapy 网页、邮件列表和 stackoverflow,但我似乎找不到关于编写初学者可以理解的快速爬虫的通用建议.也许我的问题不是蜘蛛本身,而是我运行它的方式.欢迎所有建议!

I have looked on the scrapy webpage and the mailing lists and stackoverflow, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome!

如果需要,我已经在下面列出了我的代码.

I have listed my code below, if it's needed.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re

class Sale(Item):
    Adresse = Field()
    Pris = Field()
    Salgsdato = Field()
    SalgsType = Field()
    KvmPris = Field()
    Rum = Field()
    Postnummer = Field()
    Boligtype = Field()
    Kvm = Field()
    Bygget = Field()

class HouseSpider(BaseSpider):
    name = 'House'
    allowed_domains = ["http://boliga.dk/"]
    start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("id('searchresult')/tr")
        items = []      
        for site in sites:
            item = Sale()
            item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
            item['Pris'] = site.select("td[2]/text()").extract()
            item['Salgsdato'] = site.select("td[3]/text()").extract()
            Temp = site.select("td[4]/text()").extract()
            Temp = Temp[0]
            m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp)
            if m:
                found = m.group(1)
                item['SalgsType'] = found
            else:
                item['SalgsType'] = Temp
            item['KvmPris'] = site.select("td[5]/text()").extract()
            item['Rum'] = site.select("td[6]/text()").extract()
            item['Postnummer'] = site.select("td[7]/text()").extract()
            item['Boligtype'] = site.select("td[8]/text()").extract()
            item['Kvm'] = site.select("td[9]/text()").extract()
            item['Bygget'] = site.select("td[10]/text()").extract()
            items.append(item)
        return items

谢谢!

推荐答案

这里有一系列可以尝试的事情:

Here's a collection of things to try:

  • 使用最新的scrapy版本(如果还没有使用)
  • 检查是否使用了非标准中间件
  • 尝试增加 CONCURRENT_REQUESTS_PER_DOMAINCONCURRENT_REQUESTS 设置(文档)
  • 关闭日志记录 LOG_ENABLED = False (文档)
  • 尝试在循环中yield生成一个项目,而不是将项目收集到items列表中并返回它们
  • 使用本地缓存 DNS(请参阅此主题)
  • 检查此站点是否使用了下载阈值并限制了您的下载速度(参见 此主题)
  • 记录爬虫运行期间的 CPU 和内存使用情况 - 查看是否有任何问题
  • 尝试在 scrapyd 服务下运行相同的蜘蛛
  • 看看是否grequests + lxml 会表现得更好(询问您是否需要任何帮助来实施此解决方案)
  • 尝试在 pypy 上运行 Scrapy,参见 在 PyPy 上运行 Scrapy
  • use latest scrapy version (if not using already)
  • check if non-standard middlewares are used
  • try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
  • turn off logging LOG_ENABLED = False (docs)
  • try yielding an item in a loop instead of collecting items into the items list and returning them
  • use local cache DNS (see this thread)
  • check if this site is using download threshold and limits your download speed (see this thread)
  • log cpu and memory usage during the spider run - see if there are any problems there
  • try run the same spider under scrapyd service
  • see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
  • try running Scrapy on pypy, see Running Scrapy on PyPy

希望有所帮助.

这篇关于加速网页刮刀的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆