在使用open-uri和nokogiri完全加载HTML之前,已读取HTML [英] HTML is read before fully loaded using open-uri and nokogiri

查看:86
本文介绍了在使用open-uri和nokogiri完全加载HTML之前,已读取HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将open-urinokogiri与ruby一起使用来进行一些简单的Web爬网. 有一个问题,有时html在完全加载之前会被读取.在这种情况下,除了加载图标和导航栏之外,我无法获取任何其他内容. 告诉open-urinokogiri等待页面完全加载的最佳方法是什么?

I'm using open-uri and nokogiri with ruby to do some simple webcrawling. There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar. What is the best way to tell open-uri or nokogiri to wait until the page is fully loaded?

当前我的脚本如下:

require 'nokogiri'
require 'open-uri'

url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)) 
puts doc.at_css("h2").text

推荐答案

您所描述的内容是不可能的. open的结果仅在open方法返回后传递给HTML,并返回完整值.

What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.

我怀疑页面本身已按照评论中的建议使用AJAX加载其内容,在这种情况下,您可以使用Watir通过浏览器获取页面

I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser

require 'nokogiri'
require 'watir'

browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'

doc = Nokogiri::HTML.parse(browser.html)

这可能会打开浏览器窗口.

This might open a browser window though.

这篇关于在使用open-uri和nokogiri完全加载HTML之前,已读取HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆