在 python Scrapy 中执行 SplashRequest 时添加等待元素 [英] Adding a wait-for-element while performing a SplashRequest in python Scrapy

查看:56
本文介绍了在 python Scrapy 中执行 SplashRequest 时添加等待元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 python 中使用 Splash for Scrapy 抓取一些动态网站.但是,我发现在某些情况下,Splash 无法等待完整页面加载.解决这个问题的一个蛮力方法是增加一个很大的 wait 时间(例如,在下面的代码片段中为 5 秒).但是,这效率极低,并且仍然无法加载某些数据(有时加载内容需要超过 5 秒).是否有某种等待元素条件可以通过这些请求?

I am trying to scrape a few dynamic websites using Splash for Scrapy in python. However, I see that Splash fails to wait for the complete page to load in certain cases. A brute force way to tackle this problem was to add a large wait time (eg. 5 seconds in the below snippet). However, this is extremely inefficient and still fails to load certain data (sometimes it take longer than 5 seconds to load the content). Is there some sort of a wait-for-element condition that can be put through these requests?

yield SplashRequest(
          url, 
          self.parse, 
          args={'wait': 5},
          'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36",
          }
)

推荐答案

是的,您可以编写 Lua 脚本来执行此操作.类似的东西:

Yes, you can write a Lua script to do that. Something like that:

function main(splash)
  splash:set_user_agent(splash.args.ua)
  assert(splash:go(splash.args.url))

  -- requires Splash 2.3  
  while not splash:select('.my-element') do
    splash:wait(0.1)
  end
  return {html=splash:html()}
end

在 Splash 2.3 之前,您可以使用 splash:evaljs('!document.querySelector(".my-element")') 而不是 not splash:select('.my-element').

Before Splash 2.3 you can use splash:evaljs('!document.querySelector(".my-element")') instead of not splash:select('.my-element').

将此脚本保存到一个变量中 (lua_script = """ ... """).然后你可以发送这样的请求:

Save this script to a variable (lua_script = """ ... """). Then you can send a request like this:

yield SplashRequest(
    url, 
    self.parse, 
    endpoint='execute',
    args={
        'lua_source': lua_script,
        'ua': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36"
    }
}

参见脚本教程reference 了解有关如何编写 Splash Lua 脚本的更多详细信息.

See scripting tutorial and reference for more details on how to write Splash Lua scripts.

这篇关于在 python Scrapy 中执行 SplashRequest 时添加等待元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆