在使用open-uri和nokogiri完全加载HTML之前,已读取HTML [英] HTML is read before fully loaded using open-uri and nokogiri
问题描述
我将open-uri
和nokogiri
与ruby一起使用来进行一些简单的Web爬网.
有一个问题,有时html在完全加载之前会被读取.在这种情况下,除了加载图标和导航栏之外,我无法获取任何其他内容.
告诉open-uri
或nokogiri
等待页面完全加载的最佳方法是什么?
I'm using open-uri
and nokogiri
with ruby to do some simple webcrawling.
There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar.
What is the best way to tell open-uri
or nokogiri
to wait until the page is fully loaded?
当前我的脚本如下:
require 'nokogiri'
require 'open-uri'
url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE))
puts doc.at_css("h2").text
推荐答案
您所描述的内容是不可能的. open
的结果仅在open
方法返回后传递给HTML
,并返回完整值.
What you describe is not possible. The result of open
will only be passed to HTML
after the open
method as returned the full value.
我怀疑页面本身已按照评论中的建议使用AJAX加载其内容,在这种情况下,您可以使用Watir通过浏览器获取页面
I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser
require 'nokogiri'
require 'watir'
browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'
doc = Nokogiri::HTML.parse(browser.html)
这可能会打开浏览器窗口.
This might open a browser window though.
这篇关于在使用open-uri和nokogiri完全加载HTML之前,已读取HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!