Java Selenium:如何在不首先加载页面的情况下获取网页的HTML? [英] Java Selenium: how can I get the HTML of a webpage without first loading the page?

查看:864
本文介绍了Java Selenium:如何在不首先加载页面的情况下获取网页的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Selenium WebDriver for Java,是否有可能获得给定指定URL的网页的HTML?



我知道,一旦网页加载到浏览器,可以使用WebDriver.getPageSource()获取HTML。但是,为了提高效率,是否有可能在不首先在浏览器中加载页面的情况下获取HTML?解析方案

您可以实现这使用无头浏览器。



无头浏览器是一个没有图形用户界面的网页浏览器。此程序的行为与浏览器相同,但不会显示任何GUI。



无头浏览器通常用于以下情况: -




  • 您的中央构建工具没有安装任何浏览器。因此,为了在每次构建之后进行基本的理智测试,您可以使用无头浏览器来运行测试。

  • 你想写一个爬行程序通过不同的页面和收集数据,无头浏览器将是您的选择。因为你真的不在乎打开浏览器。您只需访问网页即可。
  • 您想在同一台机器上模拟多个浏览器版本。在这种情况下,你会想使用无头浏览器,因为它们大多支持模拟不同版本的浏览器。我们很快就会谈到这一点。 使用无头浏览器之前要注意的事项





无头浏览器是模拟程序,它们不是您真正的浏览器。大多数这些无头浏览器已经足够模拟,像一个真正的浏览器一样非常近似。您仍然不想在无头浏览器中运行所有测试。在使用无头浏览器之前,JavaScript是您想要非常小心的一个领域。 JavaScript由不同的浏览器实现。虽然JavaScript是一种标准,但每个浏览器在实现JavaScript方面都有自己的小差异。在无头浏览器的情况下也是如此。例如,HtmlUnit无头浏览器使用Rihno JavaScript引擎,该引擎没有被任何其他浏览器使用。



一些无头驱动程序的例子包括




  • HtmlUnit

  • Ghost

  • PhantomJS

  • ZombieJS

  • Watir-webdriver


Using Selenium WebDriver for Java, is it possible to get the HTML of a webpage given a specified URL?

I know that, once a webpage is loaded in a browser, the HTML can be obtained using WebDriver.getPageSource(). However, for improved efficiency, is it possible to obtain the HTML without loading the page in a browser first?

解决方案

You can achieve this using headless browser.

A headless browser is a web-browser without a graphical user interface. This program will behave just like a browser but will not show any GUI.

Headless browsers are typically used in following situations :-

  • You have a central build tool which does not have any browser installed on it. So to do the basic level of sanity tests after every build you may use the headless browser to run your tests.

  • You want to write a crawler program that goes through different pages and collects data, headless browser will be your choice. Because you really don’t care about opening a browser. All you need is to access the webpages.

  • You would like to simulate multiple browser versions on the same machine. In that case you would want to use a headless browser, because most of them support simulation of different versions of browsers. We will come to this point soon.

Things to pay attention to before using headless browser

Headless browsers are simulation programs, they are not your real browsers. Most of these headless browsers have evolved enough to simulate, to a pretty close approximation, like a real browser. Still you would not want to run all your tests in a headless browser. JavaScript is one area where you would want to be really careful before using a Headless browser. JavaScript are implemented differently by different browsers. Although JavaScript is a standard but each browser has its own little differences in the way that they have implemented JavaScript. This is also true in case of headless browsers also. For example HtmlUnit headless browser uses the Rihno JavaScript engine which not being used by any other browser.

Some of the examples of Headless Drivers include

  • HtmlUnit
  • Ghost
  • PhantomJS
  • ZombieJS
  • Watir-webdriver

这篇关于Java Selenium:如何在不首先加载页面的情况下获取网页的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆