如何抓取延迟加载的页面 [英] How to scrape pages which have lazy loading

查看:50
本文介绍了如何抓取延迟加载的页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我用来解析网页的代码.我是在 rails 控制台中完成的.但是我在 rails 控制台中没有得到任何输出.我想抓取的站点正在延迟加载

需要'nokogiri'需要'open-uri'页 = 1虽然是真的url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"doc = Nokogiri::HTML(open(url))doc = Nokogiri::HTML(doc.at_css('#ajax').text)d = doc.css(".rslwrp")d. 每次做 |t|放置 t.css(".jrcw").text放置 t.css("span.jcn").text放置 t.css(".jaid").text把 t.css(".estd").text页+=1结尾结尾

解决方案

这里有 2 个选项:

  1. 将纯 HTTP 抓取切换到一些支持 javascript 评估的工具,例如 Capybara(使用 选择了合适的驱动程序).这可能会很慢,因为您在后台运行无头浏览器,而且您必须设置一些超时或想出另一种方法来确保在开始任何抓取之前加载您感兴趣的文本块.

  2. 第二个选项是使用 Web Developer 控制台并弄清楚如何加载这些文本块(哪些 AJAX 调用、它们的参数等)并在您的抓取工具中实现它们.这是更高级的方法,但性能更高,因为您不会像在选项 1 中那样做任何额外的工作.

祝您有美好的一天!

更新:

您上面的代码不起作用,因为响应是包装在 JSON 对象中的 HTML 代码,而您正试图将其解析为原始 HTML.它看起来像这样:

<代码>{错误":0,"msg": "请求成功","paidDocIds": "这里有一些 ID","itemStartIndex": 20,"lastPageNum": 50,"markup": '很多和很多标记'}

您需要的是解包 JSON,然后解析为 HTML:

需要'json'json = JSON.parse(open(url).read) # 确保在这里检查 http 错误html = json['markup'] # 这个字段可以为空吗?检查 json['error'] 字段doc = Nokogiri::HTML(html) # 随意解析

我还建议您不要使用open-uri 因为如果你使用动态 url,你的代码可能会变得脆弱,因为 open-uri 的工作方式(阅读链接的文章了解详细信息)并使用好的和更多功能明智的库例如 HTTPartyRestClient.

更新 2:对我来说最少的工作脚本:

需要'json'需要'open-uri'需要'nokogiri'url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'json = JSON.parse(open(url).read) # 确保在这里检查 http 错误html = json['markup'] # 这个字段可以为空吗?检查 json['error'] 字段doc = Nokogiri::HTML(html) # 随意解析puts doc.at_css('#newphoto10').attr('title')# =>Raaj Batra Lal Kitab 博士 德里东帕特尔纳加尔的专家

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading

require 'nokogiri'
require 'open-uri'

page = 1
while true
  url =  "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"


  doc = Nokogiri::HTML(open(url))
  doc = Nokogiri::HTML(doc.at_css('#ajax').text)
  d = doc.css(".rslwrp")
  d.each do |t|
     puts t.css(".jrcw").text
     puts t.css("span.jcn").text
     puts t.css(".jaid").text
     puts t.css(".estd").text
    page+=1
  end
end

解决方案

You have 2 options here:

  1. Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.

  2. Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.

Have a nice day!

UPDATE:

Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:

{
  "error": 0,
  "msg": "request successful",
  "paidDocIds": "some ids here",
  "itemStartIndex": 20,
  "lastPageNum": 50,
  "markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}

What you need is unwrap JSON and then parse as HTML:

require 'json' 

json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like

I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.

UPDATE 2: Minimal working script for me:

require 'json'
require 'open-uri'
require 'nokogiri'

url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'

json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

这篇关于如何抓取延迟加载的页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆