使用scrapy-splash会显着影响抓取速度吗? [英] Does using scrapy-splash significantly affect scraping speed?

查看:132
本文介绍了使用scrapy-splash会显着影响抓取速度吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

到目前为止,我一直只使用scrapy和编写自定义类来处理使用ajax的网站.

So far, I have been using just scrapy and writing custom classes to deal with websites using ajax.

但是如果我使用scrapy-splash,据我所知,它在javascript之后抓取渲染的html,我的爬虫速度会受到显着影响吗?

But if I were to use scrapy-splash, which from what I understand, scrapes the rendered html after javascript, will the speed of my crawler be affected significantly?

用scrapy 和javascript 用scrapy-splash 来渲染一个普通的html 页面所花费的时间之间的比较是什么?

What would be the comparison between time it takes to scrape a vanilla html page with scrapy vs javascript rendered html with scrapy-splash?

最后,scrapy-splash 和 Selenium 相比如何?

And lastly, how do scrapy-splash and Selenium compare?

推荐答案

这取决于页面上的 javascript 数量.

It depends on the amount of javascript present on the page.

您必须知道,要渲染所有 javascript,启动画面需要一些时间,并且 Python 应用程序无需等待渲染完成即可继续运行.所以有时候飞溅也是做不到的.

You must know that to render all the javascript the splash takes some time and the python application proceeds without waiting for the rendering to be complete. So sometimes splash is also not able to do it.

  • 您可以显式地等待渲染,因为它通常需要一些时间.
  • 此外,进行一些等待也是一个好习惯.
  • You can explicitly put a wait for rendering as it needs some time generally.
  • Also it is a good practice to put up some wait.

这里,

import scrapy
from scrapy_splash import SplashRequest

yield scrapy.Request(url, callback=self.parse, meta={'splash':{'args':{'wait':'25'},'endpoint':'render.html'}})

import scrapy
from scrapy_splash import SplashRequest

yield SplashRequest(url, self.parse, endpoint='render.html',
        args={'wait': 5, 'html' : 1 } ) 

在scrapy和硒之间

Selenium 仅用于自动化网页浏览器交互,Scrapy 用于下载 HTML、处理数据并保存(整个网页抓取框架).

Between scrapy and selenium

Selenium is only used to automate web browser interaction, Scrapy is used to download HTML, process data and save it(whole web crawling framework).

谈到抓取,我会推荐 scrapy,如果问题是 javascript.

Talking about scraping I would recommend scrapy and if the problem is javascript.

  • Scrapy 已经有自己的官方 javascript 项目,名为 scrapy-splash
  • 此外,您可以在scrapy蜘蛛中从Selenium创建新的webdriver实例,做一些工作,提取数据,然后在所有工作完成后关闭它.

这篇关于使用scrapy-splash会显着影响抓取速度吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆